Association Tests Using Copy Number Profile Curves (CONCUR) Enhances Power in Rare Copy Number Variant Analysis

Copy number variants (CNVs) are the gain or loss of DNA segments in the genome that can vary in dosage and length. CNVs comprise a large proportion of variation in human genomes and impact health conditions. To detect rare CNV association, kernel-based methods have been shown to be a powerful tool because their flexibility in modeling the aggregate CNV effects, their ability to capture effects from different CNV features, and their ability to accommodate effect heterogeneity. To perform a kernel association test, a CNV locus needs to be defined so that locus-specific effects can be retained during aggregation. However, CNV loci are arbitrarily defined and different locus definitions can lead to different performance depending on the underlying effect patterns. In this work, we develop a new kernel-based test called CONCUR (i.e., Copy Number profile Curve-based association test) that is free from a definition of locus and evaluates CNV-phenotype association by comparing individuals’ copy number profiles across the genomic regions. CONCUR is built on the proposed concepts of “copy number profile curves” to describe the CNV profile of an individual, and the “common area under the curve (cAUC) kernel” to model the multi-feature CNV effects. Compared to existing methods, CONCUR captures the effects of CNV dosage and length, accounts for the continuous nature of copy number values, and accommodates between- and within-locus etiological heterogeneities without the need to define artificial CNV loci as required in current kernel methods. In a variety of simulation settings, CONCUR shows comparable and improved power over existing approaches. Real data analyses suggest that CONCUR is well powered to detect CNV effects in gene pathways associated with phenotypes using data from the Swedish Schizophrenia Study and the Taiwan Biobank. Author summary Copy number variants comprise a large proportion of variation in human genomes. Large rare CNVs, especially those disrupting genes or changing the dosages of genes, can carry relatively strong risks for neurodevelopmental and neuropsychiatric disorders. Kernel-based association methods have been developed for the analysis of rare CNVs and shown to be a valuable tool. Kernel methods model the collective effect of rare CNVs using flexible kernel functions that capture the characteristics of CNVs and measure CNV similarity of individual pairs. Typically kernels are created by summarizing similarity within an artificially defined “CNV locus” and then collapsing across all loci. In this work, we propose a new kernel-based test, CONCUR, that is based on the CNV location information contained in standard processing of the variants and removes the need for any arbitrarily defined CNV loci. CONCUR quantifies similarity between individual pairs as the common area under their copy number profile curves and is designed to detect CNV dosage, length and dosage-length interaction effects. In simulation studies and real data analysis, we demonstrate the ability of CONCUR test to detect CNV effects under diverse CNV architectures with power and robustness over existing methods.

Introduction the length and dosage features of CNVs. As with SNPs, the effects of CNVs can vary 26 between loci, but CNV collapsing tests must also account for within-locus heterogeneity 27 due to differential dosage effects or length effects within a CNV region. 28 Similar to SNP collapsing tests, there are also two families of tests for rare CNV 29 analysis: burden-based methods and kernel-based methods. Burden-based tests, e.g., 30 Raychaudhuri et al. [5], summarize the CNV features of an individual via the total CNV 31 counts or average length and model the CNV effects as fixed effects assuming etiological 32 homogeneity of features across multiple CNVs of a targeted region. Kernel-based tests, 33 e.g., CCRET [6] and CKAT [7], aggregate CNV information via genetic similarity based 34 on certain CNV features and model CNV effects as random effects to account for the 35 between-locus etiological heterogeneity. By design, burden tests are optimal when the 36 association signal is driven by homogeneous effects across CNVs, and kernel-based tests 37 are optimal in the presence of etiological heterogeneity. Burden tests often need to 38 subset CNVs by dosage (e.g., deletions only or duplications only) or size (e.g. > 100kb, 39 > 500kb) to increase homogeneity while kernel-based tests do not have such 40 requirements. 41 In this work, we focus on kernel-based methods because etiological heterogeneity is 42 becoming a more practically encountered scenario as high-resolution CNV detection 43 technologies permit the detection of CNVs with smaller length. In kernel-based 44 association tests, the association between CNVs and the trait is evaluated by examining 45 the correlation between trait similarity and CNV similarity quantified in a kernel. For 46 kernel construction, we can refer to kernel-based tests for SNPs; since SNPs are evaluated at the same single base-pair position (referred to as a locus) across 48 individuals, it is natural to assess similarity locus-by-locus and aggregate the locus-level 49 similarity over all loci in the target region to obtain an overall SNP similarity. A locus 50 unit for CNVs, however, is not so obvious since CNVs span multiple base pairs and may 51 overlap partially between individuals. 52 To address this issue, standard CNV kernel-based tests construct CNV regions 53 (CNVR). For example, the CNV Collapsing Random Effects Test (CCRET) [6] creates 54 CNVR by clustering CNV segments of different individuals with some arbitrary amount 55 of overlap (e.g., 1 base pair overlap, 50% reciprocal overlap). With CNVRs, the CNV and is powerful under heterogeneous signals and can adjust for confounders. In this analysis, we use simulation studies to demonstrate the improved power CONCUR over 91 existing kernel-based methods in a variety of settings and illustrate the practical utility 92 of CONCUR by conducting pathway analysis on the Swedish Schizophrenia Study data 93 and the Taiwan Biobank data.  al. [6]. Briefly, the TwinGene study used a cross-sectional sampling design and included 144 over 6,000 unrelated subjects born between 1911 and 1958 from the Swedish Twin 145 Registry [9,10]. CNV calls were generated using Illumina OmniExpress beadchip for 146 72,881 SNP markers and using PennCNV (version June 2011) [11] as the CNV calling 147 algorithm with recommended model parameters. From the full callset, high quality rare 148 CNVs (frequency < 1% and size > 100kb) were extracted to form the simulation pool

166
A case-control phenotype was generated from the logistic model where Z ij • is the (i, j) entry of matrix Z • , i = 1, · · · , N indexes individuals, and 168 j = 1, · · · , R indicates CNV segment. A binary covariate X i was simulated from parameter was set to be 1.

195
We examined the methods' performance under two signals: in Scenario I under a 196 dosage×length signal and in Scenario II under a dosage-only signal. We chose these 197 signals to roughly replicate the simulation settings applied to assess CKAT in [7] 198 (dosage×length signal) and to assess CCRET in [6]  Davies' method [13] as implemented in the CKAT R package.

218
Simulation Results

219
The type I error rates of the three tests were examined at nominal levels of 0.01, 0.05, 220 and 0.1 ( Table 1). All methods had type I error rates roughly around the nominal level. 221 Table 1. Type I error rates. Type I error rates of three CNV tests evaluated based on 5000 replications. The proportion of the causal deletion sites out of all deletions was 9.5%, and is 6.9% for 236 duplications. In addition, the 100 causal duplication segments had higher median and including between-locus heterogeneity due to the mixture of deleterious and protective 247 segments, between-locus heterogeneity due to duplication and deletion causal segments, 248 and within-locus heterogeneity due to duplications and deletions with a segment having 249 opposite effects. We observed that CONCUR has the best power among the three tests 250 across different settings, followed by CCRET and then by CKAT.

Real data application 260
In real data applications, we first, as a proof of concept, applied the proposed CONCUR 261 test on a previously analyzed CNV dataset from the Swedish Schizophrenia Study. We 262 next conducted a CNV-triglyceride (TG) association analysis using CONCUR on data 263 from the Taiwan Biobank.

264
CNV analysis on schizophrenia in the Swedish Schizophrenia Study 265 We conducted pathway-based CNV analysis on data from the Swedish Schizophrenia

266
Study [14]. The Swedish Schizophrenia Study used a case-control sampling design.

267
Genotyping was done in six batches using Affymetrix 5.0 (3.9% of the subjects),  Out of the 15 pathways, ten pathways were identified as significantly associated with 332 TG by CONCUR, nine pathways by CKAT, and one pathway by CCRET (Table 3).

333
There were a total of 12 pathways found significant by one or more methods, among Taking p-values < 0.05 as a suggestive "promising" association with TG, we did not 367 observe any CNV associations when all CNVs were analyzed together, but for 368 duplications only, there were promising differences in CNV length (p-value=0.0063) and 369 weaker differences in dosage (p-value=0.0255) across TG levels. There were also some 370 weak significance in CNV length for deletions (p-value=0.0423). We were cautious to 371 not over-interpret these "promising" associations since this stratified analysis reflected 372 only marginal associations of a CNV feature, and the tests did not account for the effect 373 heterogeneity that motivates the application of kernel-based methods. We also 374 proceeded with testing using CONCUR on duplications and deletions separately, and

377
To further explore the signal from duplications, we visualized CNVs in the 23 genes 378 in hsa01040 (Fig 6). ELOVL5, HSD17B4, and SCD5 (S1 Table). Notably, BAAT is an amino acid these genes are likely to affect the production and metabolism of TG. For example, one 391 study showed that hepatic steatosis was observed in ELOVL5 -knockout mice due to the 392 activation of SREBP-1c and its target genes [30]. HSD17B4 is a dehydrogenase, which 393 is able to inhibit the production of DHEA [26]. A previous study showed that TG levels 394 were inversely correlated to DHEA levels in men with type 2 diabetes [27], suggesting a 395 potential link between CNVs in HSD17B4 and TG levels. SCD5 serves as a critical 396 enzyme providing a double bond to construct complex lipid molecules such as 397 TG [28,29], and thus dysregulation of SCD5 expression may impact TG levels. X i = (X i1 , · · · , X ir ) T be the r covariates. Under the kernel machine regression 461 framework, we model the association between phenotypes and CNVs as follows where

474
We further use q = 1, · · · , P ik to index the CNV features (DS q , BP 1 q , BP 2 q ) occurring on chromosome k of individual i for k = 1, · · · , 22. Then we construct duplication and deletion profile curves respectively describing duplications and deletions on chromosome k for individual i as follows: where x is a location on the genome on the same scale as BP 1 q and BP 2 q ; I is the Appendix.

499
The intuition of the cAUC kernel is to quantify similarity using the length of The association between phenotype and CNVs is examined by testing the hypothesis 519 H 0 : h(·) = 0. To do so, we define the vector of subject-specific CNV effects 520 H = (h(Z 1 ), · · · , h(Z n )) and treat H as random effects which follow N (0, τ K), where 521 τ ≥ 0 is a variance component and K is a n × n kernel matrix with its (i, j)th entry where Y is n × 1 vector of responses; µ 0 = E(Y ) under H 0 ; φ is a dispersion factor 526 parameterizing the variance of Y ; ∆ ∈ R n×n is a diagonal matrix with its ith diagonal 527 element being δ i = 1/g (µ i ); W ∈ R n×n is a diagonal weight matrix with its ith al. [12] derived the corresponding small-sample distribution, which is used to calculate 532 the p-value in this work. S1 Table. Gene-level CONCUR tests on genes in pathway hsa01040. 535 S1 Appendix. Proof of symmetry and positive semi-definiteness of cAUC 536 kernel.