Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

An in silico procedure for generating protein-mediated chromatin interaction data and comparison of significant interaction calling methods

  • Shuyuan Lou,

    Roles Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH, United States of America

  • Shili Lin

    Roles Conceptualization, Funding acquisition, Methodology, Supervision, Writing – original draft, Writing – review & editing

    shili@stat.osu.edu

    Affiliations Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH, United States of America, Department of Statistics, The Ohio State University, Columbus, OH, United States of America, Translational Data Analytics Institute, The Ohio State University, Columbus, OH, United States of America

Abstract

The ability to simulate high-throughput data with high fidelity to real experimental data is fundamental for benchmarking methods used to detect true long-range chromatin interactions mediated by a specific protein. Yet, such tools are not currently available. To fill this gap, we develop an in silico experimental procedure, ChIA-Sim, which imitates the experimental procedures that produce real ChIA-PET, Hi-ChIP, or PLAC-seq data. We show the fidelity of ChIA-Sim to real data by using guiding characteristics of several real datasets to generate data using the simulation procedure. We also used ChIA-Sim data to demonstrate the use of our in silico procedure in benchmarking methods for significant interactions analysis by evaluating four methods for significant interaction calling (SIC). In particular, we assessed each method’s performance in terms of correct identification of long-range interactions. We further analyzed four experimental datasets from publicly available databases and shew that the trend of the results are consistent with those seen in data generated from ChIA-Sim. This serves as additional evidence that ChIA-Sim closely resembles data produced from the experimental protocols it models after.

Introduction

Chromosomal 3D interactions have been studied to investigate genome structures and functions in the past two decades [15]. Advances in biological technologies have contributed to deepened understanding of the 3D structure of a genome. The first assay for quantifying the frequency of physical contacts was chromosome conformation capture (3C) [6], a low-throughput method for two pre-specified DNA fragments. A number of 3C-derivatives have been developed, including chromosome conformation capture-on-chip (4C) [7], chromosome conformation capture carbon copy (5C) [8], and Hi-C [9], the last being a high-throughput assay that produces genome-wide contact maps. All these 3C-based methods aim to capture chromatin interactions regardless of the associated proteins in the binding complex.

There are also protein-centric methods targeting genome-wide chromatin interactions mediated by a specific protein of interest. One of the earliest such methods is ChIP-3C [10], also known as ChIP-loop, which combines chromatin immunoprecipitation (ChIP) with 3C to facilitate detection of protein-specific chromatin interactions. The protocol of ChIP-3C for preparing proximal ligation is similar to 3C, including crosslinking and restriction enzyme digestion, but the crosslinked chromatin is first ChIP-enriched before ligation. Nevertheless, it is still a 3C-based method that can only capture the interactions between two specified segments. With the introduction of pair-end tags (PET) sequencing technology, Chromatin Interaction Analysis with PET (ChIA-PET) was developed [11], which provides a comprehensive assay for detecting whole-genome chromatin interactions mediated by a specific protein; for example, ERα. ChIA-PET combines ChIP, linker ligation, and PET sequencing, where ChIP is used to identify transcription factor binding sites; linker ligation is used to connect DNA fragments captured by crosslinking; and PET is used to extract linked pairs for sequencing. The protocol of ChIA-PET has been optimized to improve the signal-to-noise ratio [12]. It replaces the restriction enzyme digestion by sonication for chromatin fragmentation as sonication can shake off the non-specific interactions, thereby reducing the background noise level. Recent advances on protein-centric interactions detection follow the principles of in situ Hi-C [13] that performs the ligation step before cell lysis to enhance the signal-to-noise ratio; examples include HiChIP [14] and proximity ligation-assisted ChIP-seq (PLAC-seq) [15]. HiChIP was proposed to improve ChIA-PET by applying ChIP on the in situ Hi-C library. PLAC-seq was introduced in the same year as HiChIP but with slightly different protocols in the library preparation step. Both HiChIP and PLAC-seq are more efficient and accurate than ChIA-PET in that they use fewer cells and output more contact reads.

Current experimental results from ChIA-PET, HiChIP, and PLAC-seq are indispensable resources for understanding micro-scale mechanisms of long-range and spatial gene regulations. It has been found that particular protein signaling and genome structural variation may alter the long-range regulatory mechanism, hence causing certain diseases [16, 17], especially cancer [10, 1820]. Output from these experimental procedures is a list of long-range interaction frequencies (counts) between pairs of anchors (genomic segments), which may or may not be on the same chromosome (Table 1 provides a partial list from a biological experiment as an example).

thumbnail
Table 1. Partial list of output from a ChIA-PET experiment (GSM832464).

https://doi.org/10.1371/journal.pone.0287521.t001

However, despite technological improvements, identified interactions—those listed in Table 1 for instance—are potentially mixed with noise, especially those with small counts. Nevertheless, an interacting pair with a small interaction count is not necessarily indicative of a false pair, as other factors also play a role, making the inference problem challenging. To distill real signals, a number of statistical methods have been developed for analyzing the data to weed out the noise and obtain a curated list of interacting pairs, referred to as significant interactions calling (SIC) [2124]. Attempts have been made to compare different methods, but most of such comparisons rely entirely on real data, for which the “ground-truth” is unknown. For the few comparisons that do use simulated data, the simulation models are typically contrived and unrealistic, such as the use of a Poisson model for generating count data and treat them as long-range interactions [25]. While most of these methods claim to have better performance in detecting true chromosomal interactions than the rest, a fair simulation study is lacking. In particular, there does not exist any benchmarking data sets with known ground truth.

In this paper, we describe ChIA-Sim, an in silico experimental protocol built on a realistic simulation scheme to mimic the biological experimental protocols of ChIA-PET, Hi-ChIP, or PLAC-seq, so that data simulated represent realistic specific-protein-mediated long-range interaction data. Since this in silico procedure will follow the major steps in real biological experimental protocols, we expect data simulated from ChIA-Sim will truly reflect real data and will provide us a tool for carrying out unbiased evaluations of SIC methods.

As an illustration of the usage of ChIA-Sim, we will use data generated from it to compare the performance of four methods for SIC analysis by assessing the pros and cons and their operational characteristics. Since this paper focuses on the ChIA-Sim data generating procedure, our evaluation of the four method is merely serving as an example. There are a number of other existing methods, and methods that become available in the future may also be assessed similarly with known ground truth ChIA-Sim in sicilo data. In addition, we will demonstrate the fidelity of the generated data from ChIA-Sim to real ChIA-PET data using key characteristics from several datasets as input to the simulation procedure. We will also analyze these real datasets to substantiate the observations seen in the simulation study regarding the performance characterstics of the four SIC methods, leading to suggestions on the potential of devising improved methods.

Materials and methods

ChIA-Sim in silico procedure

In this section, we present the ChIA-Sim in silico procedure. The biological experimental assays that are currently available for generating protein-mediated long-range interaction data—ChIA-PET, HiChIP, and PLAC-seq—all consist of five major steps: cross-linking, restriction enzyme digestion, ligation, sonication, and pair-end sequencing. Although the more recent assays, HiChIP [15] and PLAC-seq [14], follow the principles of in situ Hi-C [13] to enhance the signal-to-noise ratio, the five steps described in our in silico procedure are representative of all three biological experimental procedures. Specifically, in ChIA-Sim, each of these five steps is simulated in silico to mimic the steps of the biological experimental procedures (Fig 1), and is described as follows.

thumbnail
Fig 1. CHIA-Sim in silico experimental procedure.

The five steps are described in detailed in the text.

https://doi.org/10.1371/journal.pone.0287521.g001

Step 1: Cross-linking. To imitate the biological experimental step of cross-linking, we first randomly select NE (set denoted as E) and NP (set denoted as P) sites from the known sets of TFBS (enhancer) and TSS (promoter) sites, respectively, where the transcription factor binding sites (TFBS) are determined based on the protein of interest, whereas the transcription start sites (TSS) are obtained from a genome assembly, such as hg19. The reason for this random selection is to reduce the size of the data, if repeated simulations are performed for a large number of times, while maintaining the distribution of TFBS/TSS across chromosomes. However, we note that a user may set NE and NP to be the actual numbers in the genome to scale up their simulation. To induce random collisions (noise, non-promoter-enhancer interactions), we also randomly select (set denoted as ) and (set denoted as ) sites that are neither TFBS nor TSS. Each of the sites in is paired with each of those in , completing the cross-linking step. In particular, an E × P pairing is treated as a “true” promoter-enhancer interacting pair, whereas the other pairings are simply “noise,” representing random collisions.

Step 2: Restriction enzyme digestion. In the biological experimental protocols, the genome may be cut into fragments using a restriction enzyme. In our in silico procedure, we adopt HindIII (AAGCTT) as the default, as it is a frequent choice in real experiments. However, we have designed ChIA-Sim to be flexible so that users may supply the DNA pattern of any desired enzyme. We also note that in the original ChIA-PET protocol, this step was performed via sonication, while more recent assays such as HiChIP and PLAC-seq use restriction enzyme digestion. Therefore, we have also designed ChIA-Sim to provide the “sonication” option by using a Poisson process to randomly break chromosomes into pieces.

Step 3: Ligation. To imitate the step in an biological experiment where ligations between one end of each of two different fragments—fixed by cross-linking—take place to form “loops”, we connect one end of a piece from with a piece from , setting different probabilities, including having different probability for intra-chromosomal pairings (TFBS and TSS sites from the same chromosome) versus inter-chromosomal pairings (TFBS and TSS sites from two different chromosomes) and different probabilities between the E × P pairings versus the other types of pairings. These probabilities can be used to control the noise level (i.e. proportion of random collisions, and the degrees of difficulty in SIC analysis. Parameter values for controlling these probabilities can be set by a user.

Step 4: Sonication. To mimic the sonication (random break) process in ChIA-PET, a Poisson process is applied to each loop formed in the previous step by cutting it into fragments. Specifically, the number of cut sites on a loop is randomly sampled from a Poisson distribution with the mean set to be the length of the loop divided by a typical fragment length—250 bp as the default but it can be changed by a user. Then the locations of the cut sites are sampled independently from a Uniform distribution. The loop is then “cut” into fragments according to the chosen cut sites, and fragments that contain the ligation sites are kept and sequenced in the next step.

Step 5: Paired-end Sequencing. Reads of a certain length from the two ends of each kept fragments are obtained and mapped to the reference genome. The default on the read length is set in ChIA-Sim, but it can also take a user’s input. Anchors (peaks of mapped fragments) are identified, and counts of chromatin interactions are recorded.

At the end of these five steps, ChIA-Sim outputs a list of interacting pairs with counts, like those from a real ChIA-PET experiment (Table 1). Two different formatting options are provided and can be converted to one another depending on the need of the user; detailed can be found in the Vignette of the software package, which is available on GitHub (https://github.com/sl-lin/ChIASim.git).

Methods for SIC analysis

Methods have been developed for SIC analysis of ChIA-PET data. The first such tool is ChIA-PET tool [21] which is a pipeline including a hyper-geometric model for SIC. One shortcoming of ChIA-PET tool is that it only uses interaction counts to make decisions, while other factors that also play critical roles are not taken into account. One such important factor is the genomic distance between the two anchors of an interacting pair, where it has been documented that there is a power-law relationship between interaction frequencies and genomic distances [11, 26]. To address the deficiency of CPT, MICC [22] and Mango [23] were developed to account for the contribution of genomic distance in their decision-making process. On the other hand, DNA segments that are involved in interactions usually involve enhancers, containing TFBS of the protein of interest, and promoters, immediate upstream of TSS. To take such information into consideration, MDM [24] incorporates the distance of the two anchors in a pair to their closest TFBS and TSS. More recently, two pipelines, CID [27] and MACPET [28], were developed for simultaneous peak calling and SIC. The innovations of these two methods focus on the peak calling step, while their SIC analysis make use of existing tools described above: CID uses MICC while MACPET uses Mango. A number of other pipelines have also been developed for processing data generated from HiChIP and PLAC-seq [2931], although we note that the SIC step in these pipelines are also all similar to those proposed for ChIA-PET data, including the direct use of Mango [31]. From the above discussion, it is clear that the SIC step can be used interchangeably for data generated from ChIA-PET, HiChIP, or PLAC-seq.

Our focus of demonstrating the benchmarking utility of data from ChIA-Sim is on SIC analysis tools; therefore, we just consider the four methods discussed above that proposed novel SIC methods originally proposed for ChIA-PET data (Table 2). Taking the four methods selected as a totality, several factors that are highly informative for SIC analysis are included and highlighted in the table for benchmarking and comparing the methods, although none of the methods possess all the characteristics. First, the marginal count of an anchor (Table 2, Column 1)—the number of all pairings the anchor is involved in—is used by all method. Mango [23] does not consider pairings that are inter-chromosomal, although the existence and biological significance of such interactions have been extensively discussed and documented in the literature (e.g. [11, 21, 22, 24, 32]) and the rest of the methods considered do consider them (Table 2, Column 2). Genomic distance between the two anchors of an interacting pair is also an important feature for SIC analysis, as there may exist a power law governing the inverse relationship between genomic distances and interaction counts [9, 11]: the closer the two anchors in the 1D space the higher the interaction count. Thus, the genomic distance between two anchors can be used as a bias-correction measure for SIC analysis; MICC and Mango incorporate this information into their procedures (Table 2, Column 3). On the other hand, prior information on the particular protein of interest, the salient feature of any protein-mediated assays, offers the opportunity to use genomic annotations TFBS and TSS for distilling true promoter-enhancer interacting pairs [24], and MDM incorporates this information (Table 2, Column 4).

thumbnail
Table 2. Characteristics of methods—Ability (✓) to integrate important features in addition to interaction counts data.

https://doi.org/10.1371/journal.pone.0287521.t002

In the modeling front, CPT, the first published pipeline for analyzing ChIA-PET data, models the interaction count between a pair of anchors using a hypergeometric distribution that makes use of marginal counts to complete its specification of the parameters of the distribution [21]. SIC inference is determined by calculating the raw, then FDR-adjusted p-values. MDM assumes that the interaction count of an observed pair follows a mixture of two truncated Poisson distributions, with one component modeling the interaction count for true pairs and the other for false pairs due to random collision [24], where the mean interaction frequency of a true pair is assumed to be larger than that of a false pair. Note that the truncation can accommodate the inclusion of only pairs whose interaction counts are above a certain threshold, such as greater than 1, a commonly used data preprocessing strategy [23, 33]. A hierarchical modeling scheme is used for integrating additional data, including marginal counts and genomic annotation on promoters and enhancers, for setting up prior distributions.

For the SIC analysis in Mango [23], there are two preprocessing steps to remove some observed interacting pairs: first, all inter-chromosomal interactions and those intra-chromosomal interactions with the genetic distance between the pair greater than a user-defined value (default: 1Mb); then, those that are determined to be of self-ligations. Binomial distribution is then used to model the interactions between two anchors for the remaining pairs. The probability of an interaction between a pair of anchors is estimated based on the genomic distance between them and the marginal counts. Based on raw p-values, family-wise error rates are obtained using FDR control. MICC [22] also models the observed counts using a mixture of distributions as in MDM, but further divides false interactions into two components, random collision and random ligation, where a random collision is assumed to be independent, while a random ligation is assumed to be dependent on the genomic distance between the two anchors. For counts from true pairs or random collision, the probability of an observed interaction is assumed to be inversely related to the genetic distance of the anchors, whereas the probability of an observed interaction due to random ligation is assumed to follow a hypergeometric distribution using the marginal counts similar to that of CPT. Model parameters are estimated using the Expectation-Maximization algorithm [34] and FDR control is applied to obtain family-wise error rates.

Results

Fidelity of ChIA-Sim to real experimental data

We demonstrate, in this subsection, how one may use real ChIA-PET data to provide guidance for ChIA-Sim so that the simulated data will resemble key characteristics of the real data it models after. First, three datasets were generated using ChIA-Sim following the characteristics of a K562 Pol2 ChIA-PET dataset (accession #: GSM832464). There are a total of 879, 264 interactions in the K562 Pol2 real data, which was used as the sequencing depth parameter N in ChIA-Sim. Among them, 1,792 pairs were commonly detected as true pairings by all four methods, CPT [21], MDM [24], MICC [22], and Mango [23], which were then taken as representing true interactions. For each pair, we found the nearest pair of promoter and enhancer to the two anchors. That is, for each of these 1,792 pairs, we identified the corresponding TFBS and TSS—for TSS, we used hg19 (GRCh37); for TFBS, we used a ChipSeq dataset (accession #: GSM935475) that contains TFBS targeting Pol2. The aggregate of these enhancers and promoters among all the pairs are then used as NE and NP, and they are 1359 and 1492, respectively. We then randomly selected 100 non-TSS and 100 non-TFBS loci; i.e. We considered two restriction enzymes, Mmel (scenarios A) and HindIII (scenario B), with the latter frequently used in long-range interaction experiments including Hi-C data [9] while the former used in the K562 Pol2 data we are modeling after. For the simulation using Mmel as the restriction enzyme, we further consider a larger set of potential false pairs by setting (scenario C). A summary of these three scenarios are provided in the left block of S1 Table in S1 File.

The result for scenario A, with Mmel as the restriction enzyme, the same as the real data, are presented in Fig 2(a), while those for scenarios B and C are presented as S1 Fig in S1 File. As can be seen from the figures, the distribution of the simulated counts (first boxplot) has similar profile to that of the observed counts (second boxplot) for all scenarios: both distributions are right skewed, with some large counts; the bottom 75% for both distributions cover a small range in the spectrum of counts. We then examined the genomic distance between the two locations of each pair of anchors (gDist), and found that the the middle 50% of the distances for the simulated data are consistent with those for the observed data (two middle boxplots). Resemblance between the simulated and the real data for another characteristic, the annotation distance (aDist) between the locations of each pair of anchors and the nearest pair of enhancer-promoter, is also observed (last two boxplots). Treating each boxplot in Fig 2(a) as representing a population, Wilcoxon rank-sum test was performed on 100 randomly selected observations from each of the populations to compare the counts, gDist, and aDist between the observed and the simulated distributions, with resulting p-values 0.92, 0.44, and 0.06, respectively.

thumbnail
Fig 2. Boxplots comparing several characteristics of data generated from ChIA-Sim with those of the real data based on which the simulation was modeled, for two real datasets.

The y-axis is marked on the log-scale: the first two boxplots denotes the interaction counts; the last four boxes denote two types of genomic distances as described in the paper, gDist and aDist. The labels with.obs indicate observed characteristics in the real data, whereas those with.sim presents the corresponding characteristics in the data generated from ChIA-Sim. (a) K562 Pol 2 Scenario A, (b) GM12878 CFCF; the specific parameter settings are provided in S1 Table in S1 File.

https://doi.org/10.1371/journal.pone.0287521.g002

These results demonstrate the fidelity of the simulated data from ChIA-Sim to the real data based on which the model parameters are set. In addition, the similarity between the results in all scenarios, A, B and C, indicates that setting different numbers of random sites, and , or using different restriction enzymes, HindIII or Mmel, has little influence on the important characteristics of the simulated data.

To further demonstrate the wider applicability of ChIA-Sim, we also generated data modeled after three other datasets: K562 CTCF (accession #: GSM970216), MCF7 CTCF (accession #: GSM970215), and GM12878 CTCF (accession #: GSM1872886). Compared to the K562 Pol2 data, the transcription factor CTCF binds more distally from gene promoters, and it is a highly conserved transcription factor that has significant implications in spatial gene regulation [35]. Taking GM12878, a lymphobastoid cell line, as an example, we generated interaction counts using the parameters provided in S1 Table in S1 File (last column).

As can be seen from the boxplots (Fig 2(b)), the distributions for the simulated counts (first boxplot), the genomic distance gDist (third boxplot), and the the annotation distance aDist (fifth boxplot), all resemble their observed counterparts (second, fourth, and sixth boxplots, respectively). Note that the simulated counts were scaled so that the total counts are the same as the observed data. The Wilcoxon rank-sum test led to p-values of 0.45, 0.11, and 0.38 for the counts, gDist, and aDist, respectively. The boxplots for the other other two datasets, both with the CTCF transcription factor, are provided in S2 Fig in S1 File, which also show similar key characteristics between the simulated and the observed data.

An example usage of ChIA-Sim

In this section, we demonstrate the usage of ChIA-Sim for producing benchmarking data to evaluate and compare four methods for SIC analysis: CPT [21], MDM [24], MICC [22], and Mango [23].

Setting.

We simulated datasets varying several parameters, including “sequencing depth” N, “number of true TSS loci” NP (transcription start sites, mimicking anchors in promoter regions), “number of true TFBS loci” NE (transcription factor binding sites, mimicking enhancer anchors), “number of non-TSS loci” (mimicking anchors involving in false pairings), and “number of non-TFBS loci” (also mimicking anchors involving in false pairings). In particular, N is an important parameter controlling the “sequencing depth,” whereas the other four parameters are involved in controlling the “noise” level: the larger the NP and NE relative to and , the smaller the noise level. The ratio between intra- and inter-chromosomal interactions may also be controlled by several other runtime parameters. Further, proportions of “true” interactions to “false” interactions (involving at least one non-TFBS or non-TSS anchor) may also be controlled. In this simulation, we set all “false” interactions to be between non-TFBS and non-TSS anchors. Among the “true” loops, one may also control the degree of “easiness” so that some loops are easier to be inferred correctly in the SIC analysis while others are more difficult. Specifically, to investigate the effect of sequencing depth, we considered four settings where N is set to be 1 × 105, 3 × 105, 6 × 105, and 1 × 106, labelled as settings 1, 2, 3, and 4, respectively, whereas . We also considered another setting, setting 5, where N = 1 × 106 and to study the effect of a larger number of sites. In addition, we considered one more setting, setting 6, where N = 1 × 106, (S2 Table in S1 File). This last setting was chosen as a representative to reflect the relative proportions of three types of sites for the three datasets with CTCF as the transcription factor: enhancer, promoter, and non-specific binding sites (S3 Table in S1 File), where the ratio of the number of enhancers to that of promoters ranges from 3 to 15 while the ratio of promoters to non-specific binding sites ranges from 0.4 to 49. As an aside, we note that there is a large difference between CTCF and Pol2, not only for the relative proportions of the three types of sites, but also for the types of pairs (S4 Table in S1 File).

We manipulated the ligation probabilities so that a true pairing between P and E is nine times more likely than the false pairing between and (random collisions). For each of the settings, we generated five replicates to gauge the consistency of results within each of the SIC methods being compared. For a thorough investigation, in addition to the originally generated datasets from ChIA-Sim, we also created corresponding datasets which only retain pairs that interact more than once (i.e. pairs with count = 1 in the list produced by ChIA-Sim are removed), as this is a common practice of data preprocessing in a number of SIC methods [21]. For ease of reference, datasets that have interacting counts greater than 1 are call the g1 datasets; for example, a dataset from setting 2g1 refers to one that was simulated under setting 2 but only keeping interacting counts that are greater than 1.

Comparison of four methods.

The results from applying four methods, CPT, MICC, MDM, and Mango, to the simulated data sets retaining counts greater than one (the g1 datasets) are provided in Table 3, where the false positive rate (FPR; identification of false loops due to random pairing) and power (identification of true loops due to long-range interactions) are shown. The results for those with all pairs (i.e. including pairs that only interacted once) are relegated to Supplementary S5 Table in S1 File, as we observe similar trends. We first note that CPT [21] and MICC [22] software packages do not provide a default threshold for decision making. Hence, we used an FDR of 0.05 for MICC as it was also used in their own study [22]; we used an FDR of 0.01 for CPT considering the FDRs from CPT were all very small in our results. On the other hand, both the MDM [24] and Mango [23] packages provide a default decision rule: MDM declares pairs with a posterior probability greater than 0.5 as true interacting; whereas Mango identifies those with FDR less than 0.05 and having interacting counts of at least 2 as true pairs. These default decision rules were used in our analysis. In Table 3 and S5 Table in S1 File, overall FPR and power, as well as their breakdown into intra- and inter-pairs, are provided as the mean and standard error (those in parentheses) over five replicates of our simulated data. As can be seen from the tables, CPT labeled almost all pairs as significant. For MICC, it shows good power, as high as 0.9 (for setting 4g1), but with unacceptably high FPRs, as high as 0.8 (for setting 4). Note that there were no results provided for MICC in settings 5 and 6 due to errors when running the software. Mango, on the other hand, cannot deal with inter-chromosomal pairs; hence, results are available only for intra pairs. We observe that it also has severely inflated FPRs, but generally with lower power than MICC or CPT. On the other hand, MDM has the lowest power but its FPR is very well controlled, albeit a bit conservative. In summary, across all the settings and replicates, CPT, MICC, and Mango all have higher power than MDM, but the higher power is offset by their severely inflated false positive rates. These results are consistent from replicate to replicate.

thumbnail
Table 3. Simulation results for pairs with counts > 1.

The average and standard deviation (within parentheses) across five replicates are provided for each setting and each method.

https://doi.org/10.1371/journal.pone.0287521.t003

For a better visualization of the results, and especially for understanding the effects of N on the outcomes, we plotted the mean power and mean FPR across the five replicates for each setting in Fig 3. First, we see that power generally increases when N increases (from setting 1 to setting 4, with or without including pairs that interact only once). Unfortunately, the FPR also increases with the increase of N, except for MDM where the FPR are always very low. On the other hand, when N is set to be the same but with either increasing or uneven numbers of sites (from setting 4 to setting 5 or setting 6), both CPT and MDM see a decrease in power regardless of whether the one-interacting pairs were included, whereas Mango sees an increase in power. It is interesting to observe that Mango has a steep decrease in FPR for setting 5, which is explainable by the greater occurrences of one-interacting pairs and such pairs, more likely to represent false loops, are already removed in Mango’s preprecessing steps. However, its FPR increases sharply for setting 6, which is in fact higher than its power. On the other hand, MICC was unsuccessful in obtaining results for the data in both settings 5 and 6 due to errors when running the software.

thumbnail
Fig 3. Comparison of methods based on mean power (left axis) and mean FPR (right axis).

Each line is the average power/FPR over the 5 replicates of each setting. (a) Datasets consisting of pairs with interaction counts greater than 1; (b) All pairs. No results were plotted for MICC with settings 5 and 6 due to errors when running the software.

https://doi.org/10.1371/journal.pone.0287521.g003

To provide more informative results and a fairer comparison among the methods, we obtain the power of the methods for each given false positive rate, and the result for replicate 1 for a range of false positive rates (plotted on the x-axis) are shown in Fig 4 and S3 Fig in S1 File for g1 pairs and all pairs, respectively, for the first four settings where results are available for all four methods. We can see that MDM appears to have the best performance overall, with much larger area under the curve (AUC) values for both g1 pairs (Table 4) and all pairs (S6 Table in S1 File) across all replicates. Nevertheless, we note that MICC and Mango can have higher power for very small FPRs for the scenario with g1 pairs (Fig 4). Further, CPT or Mango are seen to be generally inferior to MICC or MDM for the g1 data. However, for the data including all pairs, other than setting 1, MICC’s performance appears to be lagging.

thumbnail
Fig 4. Receiver operating characteristic (ROC) curves for comparing the methods—g1 pairs of replicate 1.

https://doi.org/10.1371/journal.pone.0287521.g004

thumbnail
Table 4. Performance evaluation based on AUC for pairs with counts > 1.

https://doi.org/10.1371/journal.pone.0287521.t004

Results for all replicates are consistent with the above discussion and are provided in S4 and S5 Figs in S1 File, which also include those for settings 5 and 6 for completeness, even though there were no results for MICC.

Analyses of experimental datasets

To further substantiate the observations of the operating characteristics of the four SIC analysis methods, we applied the same four methods to several experimental datasets, K562 Pol2, K562 CTCF, MCF7 CTCF, and GM12878 CTCF, the same four datasets used for our fidelity study in Section 3.1. For K562 Pol2, in addition to the all and g1 pairs as we did in the simulation study, we also consider the g2 pairs (i.e. removing pairs with one and two interaction counts), which may be viewed as more reliable so that we can more thoroughly compare the four methods. For K562 CTCF and MCF7 CTCF, we analyzed the g1 pairs, as these data were already filtered in the original study. For GM12878 CTCF, we analyzed g3 pairs, as pairs with at most three interaction counts were already removed in the available data. Throughout all the analyses, the significance calling thresholds for all the methods are the same as in the simulation study, but we also used the same FDR of 0.05 for CPT as for MICC to illustrate that setting a smaller FDR threshold for CPT did not change the overall conclusion. In the following, we describe, in detail, the results for K562 Pol2. The results for the other datasets show the same pattern and thus are qualitatively the same as those for K562 Pol2; therefore, they will be summarized together, with details given in the Supplementary.

We obtained the K562 Pol2 ChIA-PET data from the ChiaSig package [25]. This dataset contains intra-chromosomal interactions only, which is a subset of the original dataset that also contains inter-chromosomal pairs. We chose to restrict our analysis to this subset that only contains intra-chromosomal interactions since Mango can only handle intra-chromosomal data, and this provided a fairer comparison among the methods. One can see that MICC identifies 25.1, 47.4, and 53% of all, g1, and g2 pairs, respectively, as true interacting pairs, where the increase in the proportion reflects the increase in the reliability of the remaining pairs with a more and more stringent filtering criterion, although the increase from all to g1 is much larger (Table 5). For MDM, 11.3, 18.6, and 17.1% of all, g1, and g2 pairs, respectively, are called as true interacting pairs, with the trend similar to that of MICC—a large increase from all to g1 but almost the same, even a slight decrease, from g1 to g2. Despite the discrepancy in the number and proportions of true pairs identified, MICC and MDM have considerable agreement with each other: they agree on 80.3, 65.4, and 61.8% of callings for all, g1, and g2 pairs (Tables 5 and 6). For CPT, it called 83.7, 98.1, and 99.4% of all, g1, and g2 pairs as truly interacting, obviously an inflation but not as a surprise, since the result is qualitatively the same as that in the simulation study. Finally, Mango identified 1.2, 7.9, and 25% of all, g1, and g2 pairs as true interactions. Although Mango has at least a 68% agreement with MDM and 44% with MICC, we caution that these large agreements are due to the fact that most of the pairs are labelled false pairs by Mango. CPT also has a large agreement with MICC, especially for the g2 pairs. However, note that CPT calls practically 100% of the pairs as true interaction; therefore, these large agreements simply reflect the proportions of true pair callings by MICC.

thumbnail
Table 5. K562 Pol2: Percent of identified true pairs and pairwise agreement over total number of pairs1.

https://doi.org/10.1371/journal.pone.0287521.t005

thumbnail
Table 6. K562 Pol2: Cross-tabulation of the results for four methods1.

https://doi.org/10.1371/journal.pone.0287521.t006

Finally, we see that the results for using an FDR threshold of 0.05 for CPT with the g2 pairs data are practically the same as those using an FDR threshold of 0.01 (Tables 5(c), 5(d) and 6(c), 6(d)), because almost all pairs are labelled as true interacting pairs using both thresholds. Taken together, the large agreement of these results from analyzing a real dataset with those based on the ChIA-Sim data is a further indication that our ChIA-Sim in-silico procedure indeed mimics experimental ChIA-PET protocol.

For the three datasets with CTCF as the transcription factor, the results are given in S7–S9 Tables in S1 File. As extreme as with the K562 Pol2 data, CPT identified all pairs as true interacting pairs. For MDM and MICC, a moderate proportion of pairs are identified as true interactions, with MICC typically identifies a large proportion than MDM, especially for the GM12879 data, consistent with the K562 Pol2 results. MICC and MDM also have a large proportion of agreement among all three datasets. The results for Mango, on the other hand, depart markedly from that for the K562 Pol2 data, in that almost all pairs were identified as true interacting pairs with the CTCF transcription factor. For visual comparison of the performance between the real data and the simulated ones, we created ROC plots for the four experimental data, treating enhancer-promoter and enhancer-enhancer interactions as “true positives” and the rest as “false positives”. We included the enhancer-enhancer pairs as such pairings are biologically meaningful in forming protein complex for regulating gene expressions. It is interesting to see that, for the three datasets with CTCF as the transcription factor, the curves are very similar to those from the simulation study. However, for K562 Pol2, the result is a bit different from the rest, most notable being that the AUC for CPT is less than 0.5 (S6 Fig in S1 File).

Similar results between the simulation study and real data analysis notwithstanding, none of the methods utilize both important features in generating experimental ChIA-PET type data that can affect the performance of the results, namely, the gDist, the genomic distance between each pair of anchors, and the aDist, the genomic distancs from a pair of anchors to the nearest pair of promoter-enhancer. To understand how the results from each of the methods reflect their dependency on these two types of distances, for the K562 Pol2 data, we plotted interaction counts against gDist (Fig 5) and aDist (Fig 6), all on the log-scale, for the analysis results with g1 pairs. The scatter plots are superimposed by support vector machine (SVM) generated decision boundaries: dark grey, light grey, and white regions indicating non-significant, moderate-significant, and significant regions, respectively. That is, the SVM boundaries divide the plane into three regions. The white region contains pairs that are labelled as true interacting pairs (either small false discovery rate or large posterior probability); the dark grey region consists of pairs that have no evidence of being signifying true interactions (either with large false discovery rate or with small posterior probability); the light grey region is composed of the rest of the pairs that have some evidence of being a true pairs (moderately small false discovery rate or moderately large posterior probability). The light grey region is usually difficult to see because it is small (e.g. the MDM plot in Fig 6 shows a visible but small light-grey region); therefore, even though the SVM attempted to separate each plane into three regions, the middle one is typically very small, indicating that, for most pairs, there are strong evidence for them to be either a true interaction pair or not.

thumbnail
Fig 5. Scatterplot of log(gDist) vs. log(count), with the color gradient of the dots indicating FDR (for MICC, Mango, and CTP) or posterior probability (PP, for MDM).

Points in each of the plot are divided into three groups: (1) significant—FDR (0, 0.05) or PP (0.8, 1); (2) moderately-significant—FDR (0.05, 0.2) or PP (0.5, 0.8); (3) non-significant—FDR (0.2, 1) or PP (0, 0.5). The decision boundaries are generated by support vector machine using the R-package e1071 [36], leading to regions shaded in white (Region 1: significant), light-grey (Region 2: moderately significant), and dark grey (Region 3: non-significant), respectively.

https://doi.org/10.1371/journal.pone.0287521.g005

thumbnail
Fig 6. Scatterplot of log(aDist) vs. log(count), with the color gradient of the dots indicating FDR (for MICC, Mango, and CTP) or posterior probability (PP, for MDM).

Points in each of the plot are divided into three groups: (1) significant—FDR (0, 0.05) or PP (0.8, 1); (2) moderately-significant—FDR (0.05, 0.2) or PP (0.5, 0.8); (3) non-significant—FDR (0.2, 1) or PP (0, 0.5). The decision boundaries are generated by support vector machine using the R-package e1071 [36], leading to regions shaded in white (Region 1: significant), light-grey (Region 2: moderately significant), and dark grey (Region 3: non-significant), respectively.

https://doi.org/10.1371/journal.pone.0287521.g006

For MICC and Mango, which take gDist into account for inference, the results are as expected: since pairs with smaller gDist are easier to be randomly ligated in the experiment, they need a greater interaction count to be declared a true pair; likewise, when two anchors are far apart, that is, with a large gDist, we expect pairs with relatively fewer interaction counts to be identified as positive. This is evident in the sharp fall of the decision boundary for the pairs declared to be non-significant (Fig 5(a) and 5(b)). On the other hand, the decision boundaries for MDM are basically flat, which is also expected since it does not use gDist for inference (Fig 5(c)). For CPT, almost all pairs are classified as significance; hence, the decision boundaries are not visually apparent, and thus not gDist dependent, which is also not surprising given the procedure does not make use of gDist either. In contrast, except for MDM, none of the other methods show a systematic trend according to aDist in the decision boundaries (Fig 6(a), 6(b) and 6(d)). The gradual increase in the boundaries of MDM (Fig 6(c)) are anticipated as it assigns higher posterior probabilities to pairs that are closer to the nearest promoters and enhancers for a given interaction count.

Discussion and conclusion

The ability to simulate high throughput data with high fidelity to those from real biological experiments is fundamental for benchmarking methods for significant interaction calling analysis and other investigations related to understanding the 3D genome, and ChIA-Sim is a software tool proposed for such a purpose. It imitates, step-by-step, the experimental procedure that produces real specific-protein mediated chromatin interaction data. We demonstrated that one can extract salient features from a real dataset as input to ChIA-Sim, and produce in-silio protein-mediated chromatin interaction data that has high fidelity to the real data in the important features. Such data now possess the “ground truth,” a feature that is indispensable for fully benchmarking statistical analysis methods. Although many experimental datasets are available and easily accessible in the public domains and databases, the absolute “true positives” and “false positives” are usually unknown. In the literature, when simulation studies for SIC analyses were performed, they typically just drew data from some known distribution, such as Poisson, which is unrealistic. Therefore, we believe ChIA-Sim has the benefit of generating in silico data with high fidelity to real data and having known ground truth. In some way, it bridges the gap between real experimental data and simulation data from a simple distribution.

As an illustration of an usage of data generated from ChIA-Sim, we carried out a comprehensive evaluation of several SIC analysis methods, which yielded a wealth of information on their operational characteristics. Similar evaluation can be performed to benchmark other analysis methods, if one desires. Based on such results, one may identify relatively better performers so that recommendation can be provided to biological researchers. Characteristics underlying the better performers in terms of their utility for biological data may also be ascertained for development of improved methods; this is a particularly important benefit from using ChIA-Sim data based on what we have observed from benchmarking the four SIC analysis methods. Although we can see from the results that MICC and MDM perform considerably better compared to the other two methods, neither is satisfactory: MDM controls the FPR well, but its power is low; on the other hand, MICC has good power but unacceptably large FPR.

Specifically, our further analysis with four experimental dataset from ChIA-PET substantiates the findings about the operational characteristics of the four SIC analysis methods selected for comparison, and cements the utility of the in-silico data from ChIA-Sim for identifying potential issues. In particular, it is clearly seen that genomic distance between the two anchors of an interacting pair and their distance to the closest promoter and enhancer play integral roles in chromatin interaction, and their utilization can lead to better detections of truly interacting pairs that are biologically relevant. Inspired by the results seen in the real data analyses (Figs 5 and 6), we drew similar scatterplots and SVM boundaries for the simulation settings (S7 and S12 Figs in S1 File). One can see that, MICC and Mango, which utilize gDist (the genetic distance between the two anchors) have the tendency of only needing a smaller number of counts for a pair with a larger gDist to declare it truly interacting. On the other hand, MDM, which uses aDist (the annotation distance between the pair of anchors and the nearest pair of promoter-enhancer), requires a larger number of interaction counts to declare a pair with a larger aDist to be a true interacting pair. Since CPT does not utilize either of these two types of distances, the results are not affected at all by aDist or gDist. These results are consistent with those seen from the experimental data analyses, further demonstrating that the various model assumptions of the four methods are reflected in the observed relationships between gDist/aDist and interaction counts.

Nevertheless, none of the methods in the current literature, including those reviewed and compared in this paper, take both of these features into consideration in their SIC procedures. This lack of usage of additional and readily available information may be the most likely reason for the less than satisfactory performance of the methods compared. Finally, although we only focus on demonstrating the usage of ChIA-Sim data for benchmarking SIC analysis methods, data generated from ChIA-Sim can also be used for benchmarking statistical analysis methods for investigating other aspects of 3D chromosomal interactions.

Acknowledgments

The authors would like to thank the three anonymous reviewers for their constructive comments and suggestions.

References

  1. 1. Schoenfelder S, Fraser P. Long-range enhancer–promoter contacts in gene expression control. Nat Rev Genet. 2019;20:437–455. pmid:31086298
  2. 2. Lohia R, Fox N, Gillis J. A global high-density chromatin interaction network reveals functional long-range and trans-chromosomal relationships. Genome biology. 2022;23(1):238. pmid:36352464
  3. 3. Dekker J, Misteli T. Long-Range Chromatin Interactions. Cold Spring Harbor perspectives in biology. 2015;7(10):a019356. pmid:26430217
  4. 4. Fujita Y, Yamashita T. Spatial organization of genome architecture in neuronal development and disease. Neurochemistry international. 2018;119:49–56. pmid:28757389
  5. 5. Krumm A, Duan Z. Understanding the 3D genome: Emerging impacts on human disease. Seminars in cell and developmental biology. 2019;90:62–77. pmid:29990539
  6. 6. Dekker J, Rippe K, Dekker M, Kleckner N. Capturing Chromosome Conformation. Science. 2002;295(5558):1306–1311. pmid:11847345
  7. 7. Zhao Z, Tavoosidana G, Sjölinder M, Göndör A, Mariano P, Wang S, et al. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nature genetics. 2006;38(11):1341–1347. pmid:17033624
  8. 8. Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, et al. Chromosome Conformation Capture Carbon Copy (5C): A massively parallel solution for mapping interactions between genomic elements. Genome Research. 2006;16(10):1299–1309. pmid:16954542
  9. 9. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science. 2009;326(5950):289–293. pmid:19815776
  10. 10. Carroll JS, Liu XS, Brodsky AS, Li W, Meyer CA, Szary AJ, et al. Chromosome-Wide Mapping of Estrogen Receptor Binding Reveals Long-Range Regulation Requiring the Forkhead Protein FoxA1. Cell. 2005;122(1):33–43. pmid:16009131
  11. 11. Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, et al. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature. 2009;462(7269):58–64. pmid:19890323
  12. 12. Fullwood M, Ruan Y. ChIP-based methods for the identification of long-range chromatin interactions. J Cell Biochem. 2009; p. 30–39. pmid:19247990
  13. 13. Rao SSP, Huntley MH, Durand NC, Stamenova EK, et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell. 2014;159(7):1665–1680. pmid:25497547
  14. 14. Mumbach M, et al. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nat Methods. 2016;13:919–922. pmid:27643841
  15. 15. Fang R, Yu M, Li G, et al. Mapping of long-range chromatin interactions by proximity ligation-assisted ChIP-seq. Cell Res. 2016;26:1345–1348. pmid:27886167
  16. 16. Hugo P, Krijger L, de Laat W. Regulation of disease-associated gene expression in the 3D genome. Nature Reviews Molecular Cell Biology. 2016;17:771–782.
  17. 17. Siddappa M, Wani SA, Long MD, et al. Identification of transcription factor co-regulators that drive prostate cancer progression. Sci Rep. 2020; p. 20332. pmid:33230156
  18. 18. Wang Q, Li W, Zhang Y, Yuan X, Xu K, Yu J, et al. Androgen receptor regulates a distinct transcription program in androgen-independent prostate cancer. Cell. 2009;138(2):245–256. pmid:19632176
  19. 19. Yu J, Yu J, Mani RS, Cao Q, Brenner CJ, Cao X, et al. An Integrated Network of Androgen Receptor, Polycomb, and TMPRSS2-ERG Gene Fusions in Prostate Cancer Progression. Cancer Cell. 2010;17(5):443–454. pmid:20478527
  20. 20. Zhou Y, Gerrard DL, Wang J, Li T, Yang Y, Fritz AJ, et al. Temporal dynamic reorganization of 3D chromatin architecture in hormone-induced breast cancer and endocrine resistance. Nature Communications. 2019;10(1):1522. pmid:30944316
  21. 21. Li G, Fullwood MJ, Xu H, Mulawadi FH, Velkov S, Vega V, et al. ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biology. 2010;11(2):R22. pmid:20181287
  22. 22. He C, Zhang MQ, Wang X. MICC: an R package for identifying chromatin interactions from ChIA-PET data. Bioinformatics. 2015;31(23):3832–3834. pmid:26231426
  23. 23. Phanstiel DH, Boyle AP, Heidari N, Snyder MP. Mango: a bias-correcting ChIA-PET analysis pipeline. Bioinformatics. 2015;31(19):3092–3098. pmid:26034063
  24. 24. Niu L, Li G, Lin S. Statistical Models for Detecting Differential Chromatin Interactions Mediated by a Protein. PLoS ONE. 2014;9(5):e97560. pmid:24835279
  25. 25. Paulsen J, Rodland E, Holden L, Holden M, Hovig E. A statistical model of ChIA-PET data for accurate detection of chromatin 3D interactions. Nucleic Acids Research. 2014;42. pmid:25114054
  26. 26. Mirny LA. The fractal globule as a model of chromatin architecture in the cell. Chromosome Research. 2011;19(1):37–51. pmid:21274616
  27. 27. Guo Y, Krismer K, Closser M, Wichterle H, Gifford DK. High resolution discovery of chromatin interactions. bioRxiv. 2018.
  28. 28. Vardaxis I, Drablos F, Rye MB, Lindqvist BH. MACPET: Model-based Analysis for ChIA-PET. bioRxiv. 2018.
  29. 29. Bhattacharyya S, Chandra V, Vijayanand P, Ay F. FitHiChIP: Identification of significant chromatin contacts from HiChIP data. bioRxiv. 2018.
  30. 30. Juric I, Yu M, Abnousi A, Raviram R, Fang R, Zhao Y, et al. MAPS: Model-based analysis of long-range chromatin interactions from PLAC-seq and HiChIP experiments. PLOS Computational Biology. 2019;15(4):1–24. pmid:30986246
  31. 31. Lareau CA, Aryee MJ. hichipper: a preprocessing pipeline for calling DNA loops from HiChIP data. Nature Methods. 2018;15:155–156. pmid:29489746
  32. 32. Miele A, Dekker J. Long-range chromosomal interactions and gene regulation. Mol BioSyst. 2008;4:1046–1057. pmid:18931780
  33. 33. Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, et al. Extensive Promoter-centered Chromatin Interactions Provide a Topological Basis for Transcription Regulation. Cell. 2012;148(1-2):84–98. pmid:22265404
  34. 34. Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodological). 1977;39(1):1–22.
  35. 35. Taslim C, Chen Z, Huang K, Huang THM, Wang Q, Lin S. Integrated analysis identifies a class of androgen-responsive genes regulated by short combinatorial long-range mechanism facilitated by CTCF. Nucleic Acids Research. 2012;40(11):4754–4764. pmid:22344698
  36. 36. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien; 2021. Available from: https://CRAN.R-project.org/package=e1071.