Heterozygous gene truncation delineates the human haploinsufficient genome

Sequencing projects have identified large numbers of rare stop-gain and frameshift variants in the human genome. As most of these are observed in the heterozygous state, they test a gene’s tolerance to haploinsufficiency and dominant loss of function. We analyzed the distribution of truncating variants across 16,260 protein coding autosomal genes in 11,546 individuals. We observed 39,893 truncating variants affecting 12,062 genes, which significantly differed from an expectation of 12,916 genes under a model of neutral de novo mutation (p< 10−4). Extrapolating this to increasing numbers of sequenced individuals, we estimate that 10.8% of human genes do not tolerate heterozygous truncating variants. An additional 10 to 15% of truncated genes may be rescued by incomplete penetrance or compensatory mutations, or because the truncating variants are of limited functional impact. The study of protein truncating variants delineates the essential genome and, more generally, identifies rare heterozygous variants as an unexplored source of diversity of phenotypic traits and diseases

To test whether there is a subset of genes that are intolerant to heterozygous truncation, we simulated a model of generation of neutral de novo PTVs for all genes (i.e. assuming viability of affected individuals). By randomly assigning 39,893 hypothetical stop-gain and frameshift variants to genes according to their de novo mutation rate 12 , we observed that 12,916 out of 16,260 genes (95% CI, 12,805-12,991) would be expected to carry at least one stop-gain or frameshift variant. The expected number of genes is significantly greater than the 12,062 truncated genes observed in the study dataset for the same number of PTVs (6.6% depletion, empirical p-value < 1x10 -4 ; Figure 1A). The depletion in number of observed truncated genes was greater when severe PTVs, i.e. those predicted to have the greatest functional impact 13 , were considered (n=10,340 vs. a neutral expectation of 11,821-11,978; 13.1% depletion p < 4 1x10 -4 ). This suggests that a measurable fraction of de novo heterozygous stop-gain and frameshift variants are highly deleterious and hence under strong purifying selection. Hereafter we denote that fraction as the haploinsufficient genome (f hi ).

Characteristics of genes comprising the haploinsufficient genome
We assessed the functional properties of the subset of genes that were not observed to carry PTVs (n=4,198), Table 1. These genes were highly conserved, had fewer paralogs, were more likely to be part of protein complexes and were more connected in protein-protein interaction networks than the rest of the genes. Furthermore, they had characteristics of essentiality and haploinsufficiency, and a higher probability of CRISPR-Cas9 editing compromising cell viability. The set of genes not carrying PTVs was enriched in OMIM genes annotated with 'haploinsufficient' or 'dominant negative' keywords. Non truncated genes were overrepresented in functional categories such as transcription regulation, developmental processes, cell cycle, and nucleic acid metabolism (Supplementary Table 1), in line with earlier characterization of haploinsufficient genes 14 . Together, these results indicate that a number of basic cellular functions depend on the integrity of coding and expression of both alleles of component genes.

Estimating the fraction of genes intolerant to heterozygous stop-gain and frameshift variants.
Genes without PTVs in our analysis may be truly part of the haploinsufficient genome or the result of insufficient sample size to detect rare events. Thus, we next sought to estimate the total haploinsufficient fraction (f hi ) of the genome in the full population by a modeling approach.
Assuming that a fraction f hi of genes do not carry de novo PTVs while the remaining genes do so according to their neutral mutation rates 12 , f hi can be estimated by fitting a model to the observed relative distribution of PTVs (relative to the rest of genes; Online Methods). This analysis estimates a fraction of the haploinsufficient genome of f hi =10.8% (95% CI=9.5-11.7%) of protein coding genes ( Figure 1A).
Some genes may tolerate PTVs because their functional effects are masked by incomplete penetrance 15 , by compensatory variants 16 , or because of a low functional impact of the truncation 13 . In addition, false positive errors in sequencing and variant calling procedures contribute to the distribution of observed variants [17][18][19] . We collectively treated these factors as noise, because they can lead to the observation of a truncated gene in a viable individual without 5 truly probing the general viability of carrying only one functional allele in a given gene.
Therefore, we extended our model to allow for the possibility of observing PTVs in the haploinsufficient fraction of the genome by introducing a second parameter representing the number of variants originating from biological noise (incomplete penetrance, compensatory variants and low impact truncation) or technical noise (sequencing or variant calling errors) in genes otherwise intolerant to truncation (Online Methods). Using these parameters, the estimated fraction of genes intolerant to PTVs increased to 24.4% (95% CI, 18.3-32.1%, Figure   1B).
An important consequence of biological and technical noise is that the apparently truncated fraction of genes does not saturate as a function of the number of observed PTVs, but keeps rising. Our model predicts that after having sequenced 40,000 exomes (representing a sample of approximately 90,000 PTVs) more than 50% of newly identified truncated genes will result from biological and technical noise (Supplementary Figure 1) -an important consideration for ongoing sequencing programs and interpretation of resources, such as that of the Exome Aggregation Consortium (ExAC, http://exac.broadinstitute.org). At the sample size of 40,000 exomes, and with 2 to 6% of all observed truncations due to technical errors 5,6,8 , 400 to 1025 genes intolerant to PTVs will exhibit truncations due to sequencing and variant calling errors. For the same sample size, 2345 to 2549 genes intolerant to PTVs will exhibit truncations due to incomplete penetrance, compensatory variants or low impact truncation.
We next assessed the robustness of these estimates using an alternative approach that models the expected number of PTVs as a function of the observed synonymous coding variants (Online Methods). This model assumes that, in the absence of deleterious consequences, the number of heterozygous PTVs correlates with the number of synonymous variants observed in a gene. This approach resulted in highly similar estimates of f hi (95% CI 19.7-34.1%) compared to the previous model. Leveraging the latter model, we identified 278 genes (Supplementary Table 2) that have higher than 0.99 posterior probability of being intolerant to heterozygous truncation ( Figure 2). However, there is a continuum of tolerance to heterozygous truncation as depicted in Figure 2, with a large number of genes harboring fewer heterozygous PTVs than expected. 6

Discussion
This work suggests that heterozygous protein truncating variants have greater functional consequences than generally considered. This concept is supported by the identification of a substantial proportion of genes that do not tolerate loss of one of the two gene copies, and by the evidence for a gradient of haploinsufficiency across a large proportion of the coding genome.
Heterozygous PTVs are rarely compensated at the gene expression level, as shown in our previous work 13  However, we show clear evidence that over 10% of the genes cannot be compensated, while an additional 10 to 15% of truncated genes may be rescued by incomplete penetrance or compensatory mutations, or because the truncating variants are of limited functional impact.
The importance of these variants has also been observed in model organisms. Studies in mice show that when homozygous knockout mutants are not viable, up to 71.7% of heterozygous PTVs have phenotypic consequences 20 . The systematic phenotyping of knockout mice also demonstrates that haploinsufficiency might be more common than generally suspected 21 .
However, a practical limitation of the above approaches, in particular in animal studies, is that observation of phenotypes resulting from damaging mutations may require exposure to specific triggers or environmental interactions 6,21 . In contrast, in humans, life-long exposures may eventually reveal a phenotypic trait or disease associated with heterozygous gene truncations 8 .
Here, clinical symptoms could be observed later in life, and present sporadically -not necessarily within a pedigree. This is illustrated by a recent report on the consequences of haploinsufficiency of cytotoxic T-lymphocyte-associated protein 4 gene (CTLA-4) presenting as undiagnosed or misdiagnosed sporadic autoimmune disorder in the second to fifth decades of life 22 . Despite the prevalence of rare heterozygous PTVs, there has been more attention to the occurrence of homozygous truncations (human knockouts). We argue that homozygous truncations result from high allele frequency variants that are less likely to carry functional consequences (the exception being recessive disorders in a population).
There are a number of possible limitations to the present study. In the modeling work, we analyzed rare variants (less than <1% allele frequency) to focus on de novo events and for consistency with the de novo mutation rates estimated by Samocha et al. 12 . Nevertheless our 7 estimates held true when the analysis was restricted to singleton variants, or when we analyzed all variants irrespective of allele frequency (Supplementary Figure 2). We did not have primary control on sequencing coverage for some of the exome sequence datasets that could result in ascertainment errors. To correct for this potential bias, we discarded genes where the observed number of synonymous mutations deviated from expectation. The intolerance of genes to de novo truncation was assessed across combined human populations. Therefore, estimations of the haploinsufficient genome account for the fraction of haploinsufficient genes common to all humans. Intolerance to heterozygous PTVs should be regarded as a different concept than gene sequence conservation. PTVs in a conserved gene might have a recessive mode of inheritance and are thus potentially observable in a viable individual. On the other extreme, positively selected genes could be haploinsufficient upon heterozygous truncation. These considerations notwithstanding, we consistently identified a quantifiable fraction of the human genome that is intolerant to heterozygous PTVs, with an estimated lower bound of 9.5%.
The prevalent nature of rare heterozygous PTVs suggests that a map of "essentiality" on the basis of dominant loss of function is within reach. The concept of the essential genome has been explored in analyses of minimal bacterial genomes 23 , mouse knockout studies 24 , studies of transposon or chemical mutagenesis 25 , and in studies that used CRISPR-Cas9 genome-editing technology 26,27 . Here, we propose that mapping the haploinsufficient genome will improve the understanding of the genetic architecture of diseases. In agreement with the recent work of Li et al., 6 we argue that the burden of rare human heterozygous variation is an unexplored source of diversity of phenotypic traits and diseases.

Materials and methods
Exomes. We collected exome data from public and non-public sources (Supplementary Table   3). We considered these individuals as representing the general population. Variants were filtered based on Hardy-Weinberg equilibrium (discarded if p <1x10 -8 ). For public data sets, variants were called at the data source with their respective pipelines. For non-public data sets, sequence reads were aligned using BWA, and called with Haplotypecaller using GATK 3.1. Variants were annotated with SnpEff 3.1 and filtered as described in [28][29][30] . Only transcripts from autosomal protein coding genes reliably annotated by the Consensus Coding Sequence (CCDS, Release 12 8 04/40/2013) project 11 that underwent the full process of CCDS curation ('Public' status in CCDS terminology, n=17,756) were considered. As a reference background throughout all analyses, a total number of 16,521 autosomal protein coding genes was obtained by considering genes with available de novo mutation rate from Samocha et al. 12 and with at least one synonymous, missense, stop-gain or frameshift variant detected in the exome data. We discarded genes where the observed number of synonymous mutations deviated from expectation (see below). For consistency with 12 , we only retained variants mapping within the limits of the reference transcript used to assess the de novo mutation rate per gene. Furthermore, only rare stop-gain and frameshift variants (allele frequency <1%) were considered to assess the deviation from neutral expectations. Throughout the study we considered each rare variant as a single de novo event of mutation, irrespective of the number of individuals in which it was observed. First we evaluated the relative distribution of PTVs across genes (hereafter the model A). This model assumes that genes tolerating heterozygous truncation will be found truncated in the population according to their relative probability of de novo mutation (relative to the rest of genes), while a fraction of genes will not be observed as truncated due to early lethality. Based on the relative distribution of observed PTVs, this approach avoids issues of systematic false negative errors, though is still subject to false positive calls. Alternatively, we assessed a second model (hereafter the model B) in which the absolute number of de novo PTVs in a gene is estimated from the probability of de novo PTVs and the absolute number of observed de novo synonymous coding variants in that gene.

Models
Model A is formulated as follows. In a neutral case we expect that the relative fraction of variants in a given gene is equal to !"#$% !, i.e. the relative distribution of observed variants 9 follows the distribution of de novo mutation rates. As some genes might harbor fewer or more mutations than the expectation, the relative model's expected variant count for a gene g is defined as: where V stands for the total number of observed truncating variants summed over all genes, and ! ! ! accounts for the gene specific deviances from the neutral case.
Assuming two classes of genes (named HI for haploinsufficient and HS for nonhaploinsufficient) with a class-specific ! ! we get the following expectations for a gene g: where ! !! is the fraction of the total number of genes that belongs to the HI class. This model distributes a fixed number of variants to all genes according to their de novo variation rates modulated by haploinsufficiency and the penetrance of haploinsufficient genes, and is equivalent to taking V samples from a multinomial distribution with ! weights.!
To formulate model B, we assume that the expected number of de novo synonymous mutations is where ! ! !"# !is the de novo rate of synonymous mutations in a gene g and M is a constant.
Following 12 we estimate M from the regression of the observed number of synonymous mutations (! ! !"# ) in a gene on ! ! !"# : To avoid genes with low coverage, we disregarded from the analysis those genes whose residual in the above regression is higher than 3 times the standard deviation of all residuals. We note that, in contrast to 12 we omit the intercept term in this regression, because we expect no variants in a gene for which ! ! !"# !equals zero.
Having estimated M, the expected number of PTVs in a gene g is given by: Introducing gene specific differences in the number of observed PTVs we write: where ! ! accounts for both gene specific differences and systematic errors. We do not estimate ! ! for each gene, but assume that genes can be classified into two groups (haploinsufficient and non-haploinsufficient), each having a distinct class specific value of !.
To estimate the fraction of genes intolerant to heterozygous PTVs we use the following mixture model. We define a random variable ! ! as the number of variants in gene g. A latent random variable ! ! can take two values: HI or HS and has the probability density distribution: where the parameter ! !! !represents the fraction of genes intolerant to heterozygous PTVs. The conditional probability distribution of ! ! !given ! ! is defined as: Marginalizing over the values of the latent variable z ! yields the probability density distribution of ! ! !as: The probability that a gene acquiring k variants is: Using the estimated parameters we calculate the posterior probability of haploinsufficiency for gene g as: where ! ! is the observed number of PTVs in the gene g.         Black curve: observed counts, red curve: prediction based on least-squares fit to the cumulative distribution function (see Online Methods), green curve: maximum likelihood estimate, blue curve: least squares fit to the accumulation curve of truncated genes as shown in Figure 1. The CDF method was chosen and maximum likelihood was discarded because its estimates did not fit the observations. Table&S1:"Enrichment"tests"results"against"Reactome"pathways"for"genes"without"PTVs."Only"significant"results"are"shown"as"judged"by"5%"FDR"calculated"using"the"BenjaminiDHochberg"procedure.