Rare coding variants pinpoint genes that control human hematological traits

The identification of rare coding or splice site variants remains the most straightforward strategy to link genes with human phenotypes. Here, we analyzed the association between 137,086 rare (minor allele frequency (MAF) <1%) coding or splice site variants and 15 hematological traits in up to 308,572 participants. We found 56 such rare coding or splice site variants at P<5x10-8, including 31 that are associated with a blood-cell phenotype for the first time. All but one of these 31 new independent variants map to loci previously implicated in hematopoiesis by genome-wide association studies (GWAS). This includes a rare splice acceptor variant (rs146597587, MAF = 0.5%) in interleukin 33 (IL33) associated with reduced eosinophil count (P = 2.4x10-23), and lower risk of asthma (P = 2.6x10-7, odds ratio [95% confidence interval] = 0.56 [0.45–0.70]) and allergic rhinitis (P = 4.2x10-4, odds ratio = 0.55 [0.39–0.76]). The single new locus identified in our study is defined by a rare p.Arg172Gly missense variant (rs145535174, MAF = 0.05%) in plasminogen (PLG) associated with increased platelet count (P = 6.8x10-9), and decreased D-dimer concentration (P = 0.018) and platelet reactivity (P<0.03). Finally, our results indicate that searching for rare coding or splice site variants in very large sample sizes can help prioritize causal genes at many GWAS loci associated with complex human diseases and traits.

Introduction Although genome-wide association studies (GWAS) have identified thousands of associations with the number or distributional characteristics of red blood cells, white blood cells, and platelets [1], the genes responsible for this phenotypic variation remain elusive at most of these loci. For the most part, the difficulty in connecting association signals with genes resides in the fact that most GWAS variants are non-coding and in linkage disequilibrium (LD) with many other variants. This problem is not specific to hematological traits, but rather a general bottleneck in the functional characterization and clinical translation of most GWAS findings for common human diseases. Causal gene identification is important to shed light onto disease pathophysiology, but also to develop new, genetically-guided, therapeutic targets.
The discovery of rare variants that alter amino acid sequence or splicing is a strong indication that the mutated genes are involved in the phenotypes of interest. This approach is the cornerstone of Mendelian genetics. For complex human phenotypes, the discovery of rare coding variants associated with blood lipid levels [2,3], coronary artery disease [4,5], or more recently height [6] has helped prioritize causal genes and identify new drug targets (e.g. PCSK9 for atherosclerosis). For blood-cell related phenotypes, the identification of rare missense variants in the chemokine receptor gene CXCR2 [7] or the sphingosine-1 phosphate signalling gene S1PR4 [8] has highlighted the importance that these biological pathways play in neutrophil count variation in humans.
Although extremely useful tools to pinpoint causal genes, the discovery of rare variants of modest effect sizes have previously been limited because they require very large sample sizes. Here, we took advantage of genotype data at 137,086 rare (minor allele frequency (MAF) <1%) variants in 308,572 participants to find new genes implicated in human hematopoiesis. We identified 56 rare coding variants, including 31 new variants, which highlight specific genes that contribute to inter-individual variation of red blood cell (RBC), white blood cells (WBC), and platelet (PLT) traits.

new rare coding or splice site variants associated with blood-cell traits
In an effort to identify rare coding (defined in this study as missense or nonsense, excluding synonymous) or splice site variants that could implicate new genes in human hematopoiesis, we performed meta-analyses at 137,086 variants for 15 blood-cell traits in 308,572 participants (S1 Fig). These analyses found 81 missense or splice site variants at P<5x10 -8 with a minor allele frequency (MAF) <1% across the tested studies. We excluded 25 variants because they were not present in phase 1 of the Blood-Cell Consortium (BCX1) dataset (N = 2), the direction of effect was discordant across studies (N = 11), or they mapped to the HLA region (N = 15)(S1 Table). As positive controls to validate our approach, 25 rare coding or splice site variants previously associated with hematological traits were genome-wide significant in the current analyses (S2 Table) [1,[8][9][10][11]. In addition, we found 31 new rare coding or splice site variants in 29 different genes associated with blood-cell trait variation ( Table 1 and S3 and S4 Tables). The two variants in ATR are in strong linkage disequilibrium (LD), as are the two variants in EXOC3L4 (D 0 >0. 8).
All 31 variants are novel discoveries because they were not identified in previous large-scale association studies for blood-cell traits. However, although the variants are new, many of them map to known blood-cell trait loci. Indeed, using conditional analyses in the UK Biobank, we could sub-divide these 31 variants into three categories: 16 variants are in LD with markers previously reported to associate with hematological traits (P cond >0.002 in Table 1 and S5 and S6 Tables), 14 variants are not in LD with known blood-cell variants, but are located within 500 kb of one of these previously reported variants, and one variant (PLG-rs145535174) defines a new locus associated with PLT count variation in humans.
A rare missense variant in plasminogen associates with platelet count, D-dimer concentration, and platelet reactivity Plasminogen (PLG) encodes the precursor of plasmin, an important proteolytic enzyme that degrades many plasma proteins and fibrin blood clots. The rare G-allele at this new PLG variant (rs145535174, MAF = 0.05%) is associated with increased PLT count (+20x10 9 platelets/l, P = 6.8x10 -9 ). It replaces an arginine residue at position 172 into a glycine, an amino acid change that is predicted to be probably damaging by Polyphen and deleterious by SIFT [12,13]. This genomic position is highly conserved, as evidenced by a GERP score >200 and a CADD score of 23.8, ranking this particular amino acid change in PLG among the top 1% of the most deleterious substitutions in the human genome [14].
Given the role of plasmin in hemostasis, we tested the association between rs145535174 and several hemostatic and coagulation factors in the FHS. Although we only found a maximum of 16 carriers of the rare allele (varying depending on overlapping phenotypes), we could detect nominal associations between rs145535174-G and decreased D-dimer concentration and platelet reactivity ( Table 2). Because D-dimer is a by-product of fibrin clot degradation by plasmin, this result is consistent with rs145535174-G being a PLG loss-of-function variant. Since plasminogen/plasmin activity also plays a role in thrombotic diseases, we tested if the Gallele at PLG-rs145535174 increases the risk of myocardial infarction (MI), stroke, or venous thromboembolism (VTE) in the UK Biobank (S7 Table). This variant was not associated with MI or stroke risk (P>0.05). However, there was an association with increased VTE risk in the UK Biobank (P = 0.0087, odds ratio = 4.01), although it was not replicated in the Montreal Heart Institute Biobank and the Women's Health Initiative (P>0.05, S7 Table). The absence Rare coding variants impact blood-cell traits of genetic associations between PLG-rs145535174 and thrombotic events could simply reflect our limited statistical power given the rarity of the G-allele.

Blood-cell trait-associated rare variants prioritize genes at GWAS loci
When we considered genes with known hematopoietic functions, we found little evidence that the rare coding or splice site variants found here implicated different potentially causal genes than candidate genes from previous blood-cell trait GWAS (S8 Table). In other words, we have, with one exception, no loci with two strong blood-cell trait candidate genes. On chromosome 11, we found an association between mean platelet volume (MPV) and a rare coding variant in SIRT3, a gene implicated in platelet biology [15]. At the same locus, Gieger et al. had nominated PSMD13 as a candidate platelet gene on the basis of reduction in plasmatocyte numbers in Drosophila [16]. We also noted few examples where rare coding or splice site variants implicated the same genes than those prioritized by GWAS SNPs due to expression quantitative trait loci effects (e.g. SLC12A7, FLT3, TMPRSS6)(S8 Table).
Conditional analyses identified 16 rare variants in LD with previously reported blood-cell variants (S5 and S6 Tables). Despite this dependence due to LD, at some of these loci, we may have identified a rare variant with even greater biologic plausibility for the blood-cell trait association than the previously identified LD surrogate. Indeed, whereas all 16 variants from our study are coding or splice site variants, most LD surrogate variants are non-coding variants (S6 Table). To clarify whether the new ( Table 1) or previously known [1] blood-cell trait variants are more likely causal variants, we performed the reciprocal conditional analyses in which we conditioned the previously known blood-cell trait variants on genotypes at the newly described 31 rare coding or splice site variants (S9 Table). Comparison of the conditional Table 2. Association results between PLG-rs145535174 and hemostatic and coagulation factors in the Framingham Heart Study. The direction of the effects (BETA) is given for the rare G-allele (on forward strand). SE, standard error; ADP, Adenosine diphosphate; PAI-1, plasminogen activator inhibitor type 1; tPA, tissue plasminogen activator; FVII, factor VII; vWF, von Willebrand factor. We highlight in bold nominal P-values <0.05.

Trait
Sample  HIP1R). There were also a few loci where the new rare coding or splice site variants were statistically equivalent to rare non-coding variants previously reported (e.g. IL33, FLT3, BORA). We did not find examples of rare variants completely explaining the association signals at known common variants, as we would expect given limited LD.
On chromosome 1, we found a rare missense variant in CSF3R (rs148916169, MAF = 0.7%) associated with total WBC and neutrophil counts. Although this variant is correlated with common and low-frequency intronic variants (S6 Table), it is the first missense variant to directly implicate CSF3R in WBC count variation in the general population. CSF3R encodes the receptor for CSF3, a cytokine that controls the proliferation and differentiation of granulocytes, and is also mutated in a severe form of congenital neutropenia (MIM# 617014) [17]. Although rare loss-of-function mutations in CSF3R are associated with reduced neutrophil levels, the rare CSF3R missense variant found in our experiment might represent a gain-of-function since its rare allele is associated with higher WBC and neutrophil counts.
Another compelling example is the identification of a rare splice acceptor variant in IL33 (rs146597587, MAF = 0.5%) associated with reduced eosinophil count ( Table 1). This likely functional IL33 variant maps to a locus with three non-coding variants, including a rare intergenic variant, associated with WBC traits (S6 Table) [1]. Interestingly, GWAS have associated IL33 common non-coding variants with risk for asthma [18], allergic rhinitis [19], and endometriosis [20]. In the UK Biobank, we found that the rare C-allele at IL33-rs146597587 associated with reduced eosinophil count is also associated with lower risk of asthma (P = 2.60x10 -7 , odds ratio = 0.56) and allergic rhinitis (P = 4.21x10 -4 , odds ratio = 0.55) ( Table 3). These associations remained significant after accounting for the known GWAS variants previously identified at this locus ( Table 3). Our finding strongly reinforces the clinical importance that eosinophils play in the pathophysiology of these diseases, and highlight the IL33 pathway as a promising therapeutic target. However, in a mediation analysis in the UK Biobank in which we corrected for eosinophil count, we found that the protective effect of IL33-rs146597587 was weaker but remained significant (P mediation = 3.4x10 -5 ). This suggests that IL33 may influence asthma risk in part by controlling blood eosinophil count, but also through currently unknown additional biological pathways. Finally, we note that the same rare IL33 variant was recently shown to associate with reduced eosinophil count and protection from asthma in the Icelandic population [21]. Because there is no sample overlap, our study represents a strong replication of this original report: in a meta-analysis of both studies, association results now reach Table 3. Association between IL33-rs146597587 and asthma, allergic rhinitis, or endometriosis risk in the UK Biobank. All analyses were corrected for age, sex, and the first 10 principal components. The direction of the odds ratio is for the rare allele (rs146597587, C-allele on forward strand). The conditional P-values were calculated in a logistic model accounting for genotypes at common single nucleotide polymorphisms (SNPs) previously associated with asthma, allergic rhinitis, or endometriosis: rs343496, rs7032572, rs72699186, rs1342326, rs2381416, rs928413, rs10975519. Rare coding variants impact blood-cell traits genome-wide significance (in 24,030 asthma cases and 421,096 controls, P = 1.4x10 -10 , odds ratio = 0.54). We found 14 rare coding or splice site variants that appear to be associated with blood-cell traits independently of other common and/or rare variants previously identified ( Table 1 and  S6 Table). Although there are a few well-known blood-cell-related genes, such as SLC12A7 and TMPRSS6, most of the highlighted genes do not have assigned functions in hematopoiesis (e.g. MFSD2B, SDPR, C9orf66, SUSD1, EXOC3L4)( Table 1). We found two correlated rare missense variants in ATR (rs28910273, MAF = 0.6%; rs77208665, MAF = 0.7%), which encodes a checkpoint protein that coordinates cellular responses to replication stress and DNA damage. ATR is mutated in Seckel syndrome 1 (MIM #210600), which is characterized by hematological defects in some patients [22]. We also uncovered a rare splice acceptor variant in the apolipoprotein gene APOC3 (rs138326449, MAF = 0.2%) associated with RDW, a RBC trait that is considered a non-specific inflammatory marker and a predictor of cardiovascular diseases [23]. The same APOC3 variant has previously been associated with lower triglyceride levels and coronary artery disease risk [5,24].

Biological pathway analyses with blood-cell trait genes
We performed three biological pathway-enrichment analyses using 58 genes that harbour a rare coding or splice site variants associated with RBC, WBC, or PLT traits ( Table 1 and S2  Table). In total, we found six, 13, and five biological pathways or terms enriched for genes with variants associated with RBC, WBC, and PLT, respectively, at a false discovery rate <10% ( Table 4). Most of these pathways were highly redundant in terms of gene content. For instance, five of the six RBC-enriched pathways are involved in iron homeostasis. For WBC traits, there were many significant inter-connected biological pathways that implicated six genes: CSF3R, CXCR2, S1PR4, IL17RA, AMICA1/JAML, and FLT3 ( Table 4). These genes highlight the importance that genetic variation plays on cytokine signalling and chemotaxis to modulate circulating numbers of immune cells in the blood.

Discussion
Using a large sample of >300,000 participants, we identified 31 new missense or splice site variants associated with blood-cell traits. These include a rare p.Arg172Gly missense of plasminogen associated with higher platelet count, lower D-dimer concentration and lower platelet reactivity and a rare splice acceptor variant of IL33 associated with lower eosinophil count and lower risk of asthma and allergic rhinitis. At other genomic loci, our findings may prioritize the causal gene(s) that can be pursued with molecular techniques to further dissect and define the mechanism of association. Because phenotypic variance explained is determined by the effect size and allele frequency of the variants, the rare coding and splice site variants identified here do not have a large contribution to the heritability of blood-cell traits. For instance, the rare IL33-rs146597587 variant explains only 0.06% of the variation in eosinophil count.
The chromosome 2p23 region spans several genes and contains six common, non-coding variants associated previously with WBC or RBC traits [1]. We identified a new rare coding variant in this region associated with lower MCV located in MFSD2B, a transporter gene of unknown function that is highly and specifically expressed in blood and bone marrow cells, particularly of the erythroid lineage [25]. As another example, at the LY75-CD302 locus, four common non-coding variants and a low-frequency missense variant have been associated with PLT and WBC traits [10]. A second novel rare LY75-CD302 missense variant was associated with higher PLT count. The LY75-CD302 locus represents a naturally occurring read-through transcript between the lymphocyte antigen 75 (LY75) and CD302 genes, and is expressed in leukocyte and hematopoietic stem and progenitor cells [25]. Alternative splicing results in fusion LY75-CD302 gene products that are expressed during dendritic cell maturation [26] Table 4. Pathway analyses using DAVID with genes that carry rare coding or splice site variants associated with blood-cell traits. We only present biological pathways with a false discovery rate (FDR) <10%. Rare coding variants impact blood-cell traits and Hodgkin's lymphoma cell lines [27]. Another example of a potential read-through transcript involves a gene-rich region on chromosome 15, where several common variants were recently associated with WBC traits [1]. At this locus, we discovered a rare coding variant (rs184575290) located near the ST20 termination codon associated with higher monocyte count. This locus represents a naturally occurring read-through transcript that produces a fusion protein between ST20 and the neighboring MTHFS gene. ST20 is highly expressed in myeloid cells, while the fusion product is expressed at lower levels in blood neutrophils and dendritic cells of the skin [25]. We found a rare missense variant within SDPR, a phosphatidylserine-binding protein originally isolated from human platelets [28], that may be involved in modulating activation of the platelet protein kinase C substrate pleckstrin and mediating the downstream effects of platelet granule secretion [29]. Common missense variants of EXOC3L4, which has no known function, have been previously associated with both platelet and liver enzyme traits [30]. In progenitor blood-cell expression data from the BLUEPRINT Project [31], there is a large increase in EXOC3L4 expression in megakaryocyte progenitors. Several rare, non-coding variants at the chromosome 13q21 region have been associated with RBC traits [1]. We identified a rare coding variant of BORA, an activator of the protein kinase Aurora A involved in centrosome maturation and spindle assembly during mitosis and whose expression pattern during hematopoiesis increases in RBC and PLT precursors [31].

Red blood cell
Matriptase-2, encoded by TMPRSS6, has recently emerged through both complex trait and Mendelian disorders as an important genetic regulator of RBC and iron metabolism [32]. We report here the first rare coding variant of TMPRSS6 (p.V280L/p.V289L) associated with RBC traits in individuals unselected for hematologic disease. Other TMPRSS6 coding variants have been reported in patients with familial iron-refractory iron-deficiency anemia (IRIDA), a rare autosomal-recessive disorder characterized by hypochromic microcytic anemia and impaired iron balance [33].
Plasminogen plays an important role in fibrinolysis as well as wound healing, cell migration, and tissue re-modeling. Congenital plasminogen deficiency is a rare autosomal recessive disorder [34]. Severely affected individuals (type I plasminogen deficiency) can exhibit defective extracellular fibrin clearance during wound healing, leading to "ligneous conjunctivitis'' or thick, pseudomembranes on conjunctival and other mucosal surfaces. The p.Arg172Gly variant of plasminogen associated with higher PLT count is located in the first kringle domain, which mediates binding of plasminogen to fibrin and cell surfaces [35]. The observed association of the PLG p.Arg172Gly variant with lower D-dimer is consistent with reduced fibrinolysis, which might result from reduced circulating plasminogen, plasmin activity, or substrate binding. Overall, we did not observe an association between PLG p.Arg172Gly and risk of thrombotic disease. Similarly, mutations associated with congenital plasminogen deficiency do not appear to be a strong risk factor for VTE [36][37][38].
The reasons for the association of the PLG p.Arg172Gly variant with higher PLT count and lower platelet reactivity are not readily apparent. In vitro, plasmin has been reported to proteolytically inactivate thrombopoietin [39]. Plasmin also is capable of activating platelets through protease-activated receptors (PAR)-1 and -4 [40,41]. Thus, it is possible that the PLG loss-offunction variant may influence thrombopoietin-induced platelet production, or PAR-induced platelet aggregation, respectively. Finally, it is worth noting that moderately reduced platelet count is a feature of a congenital platelet disorder characterized by a gain-of-function fibrinolysis defect due to increased expression and storage of urokinase plasminogen activator (PLAU) during megakaryocyte differentiation [42].
In conclusion, focusing on 137,086 rare coding or splice site variants in 308,572 participants, we discovered 31 new variants associated with blood-cell traits. Thirty of these 31 variants map to loci previously implicated by GWAS for hematological phenotypes. Because we used the ExomeChip or the UK Biobank array, there is an enrichment of GWAS signals among the loci tested, although many of the loci identified here did not harbour blood-cell trait genetic associations at the time the arrays were designed. As discussed, many of these associations prioritize strong candidate genes at these loci. Our study was limited by the content of the genotyping arrays utilized. As we expand our genetic experiments to complete genome sequence in very large cohorts, we anticipate that we will uncover more rare coding variants and, maybe more importantly for gene identification within GWAS loci, additional independent variants in the same candidate genes. Together with a recent report for adult height [6], our findings reinforce the importance that rare variants play in the architecture of complex human phenotypes.

Ethical statement
Written, informed consent was obtained for all participants in accordance with the Declaration of Helsinki. This project was also reviewed and approved by the Montreal Heart Institute Ethics Committee and the different recruiting centers (approval number 2014-1707).

Statistical analyses
Single variant association results considered in this effort were obtained from participants of European ancestry using an additive genetic model. All phenotypes were corrected for confounding factors (see below) and normalized using inverse normal transformation. The final analyses included samples from BCX1 (up to 157,622 participants), AIRWAVE (up to 14,866), and the first release of the UK Biobank (up to 136,084). Genotypes for BCX1 and AIRWAVE were obtained from the Illumina ExomeChip array; the content is available at: http://genome. sph.umich.edu/wiki/Exome_Chip_Design. The ExomeChip includes~250,000 exonic variants obtained from whole-exome sequencing of~12,000 participants. The UK Biobank samples were genotyped on the UK Biobank Axiom array; the content of the arrays is available at: http://www.ukbiobank.ac.uk/scientists-3/uk-biobank-axiom-array/. Whereas the Axiom array targets >800,000 variants, we only analyzed exonic variants that overlap with the content of the ExomeChip.
Phenotypic information, and ExomeChip quality-control steps and association results from cohorts involved in the BCX1 Consortium and AIRWAVE have been described elsewhere [10,11,43]. Briefly, we excluded variants with low genotyping success rate (<95%, except for WHI that used a cutoff <90%). Samples with call rate <95% after joint or zCALL calling and with outlying heterozygosity rate were also excluded. Other exclusions were deviation from Hardy-Weinberg equilibrium (P<1x10 -6 ) and gender mismatch. We performed principal component analysis (PCA) or multidimensional scaling (MDS) and excluded sample outliers from the resulting plots, using populations from the 1000 Genomes Project to anchor these analyses. Prior to performing meta-analyses of the association results provided by each participating study, we ran the EasyQC protocol [44] to check for population allele frequency deviations and proper trait transformation in each cohort. In terms of hematological phenotypes, we excluded individuals with blood cancer, leukemia, lymphoma, bone marrow transplant, congenital or hereditary anemia, HIV, end-stage kidney disease, dialysis, splenectomy, and cirrhosis, and those with extreme measurements of RBC traits. We also excluded individuals on erythropoietin treatment or chemotherapy. Additionally, we excluded pregnant women and individuals with acute medical illness at the time the complete blood count (CBC) was done. For all traits, within each study, we adjusted for age, age-squared, gender, the first 10 principal components, and, where applicable, other study-specific covariates such as study center using a linear regression model. Within each study, we then applied inverse normal transformation on the residuals and tested the phenotypes for association with the ExomeChip variants using either RVtests (version 20140416) [45] or RAREMETALWORKER.0.4.9 [46].
A description of methods and quality-control procedures for the blood-cell data for the first release of the UK Biobank can also be found elsewhere [1]. Description of the exome-component of the genotyping arrays used for the UK Biobank samples can be found at: http://www. ukbiobank.ac.uk/scientists-3/uk-biobank-axiom-array/. In the UK Biobank, we modelled blood-cell traits using the following covariates: age, sex, menopause status for women, assessment centre where the blood samples were collected, machine counter that processed the blood samples, month, day of the week, time inside the day that the samples were collected, self-reported ethnic background of the individuals, smoking status and smoking frequency, alcohol drinker status, alcohol intake frequency, height and weight. In the first release of the UK Biobank, we tested the association between directly genotyped or imputed variants and normalized hematological traits with BOLT-LMM [47].
We meta-analyzed results from BCX1, AIRWAVE and the UK Biobank using inverse-variance weighting as implemented in METAL [48]. We limited our analyses to variants with a mean minor allele frequency (MAF) <1% in the meta-analyses that are annotated as coding (missense or nonsense) or splice site (acceptor or donor) using ENSEMBL's Variant Effect Predictor (VEP). Furthermore, the variants had to be present on the Illumina exome array used by the BCX1 studies. In total, we tested 137,086 variants against 15 blood-cell traits. These phenotypes are divided between the main three cell types found in blood: red blood cells (red blood cell count (RBC count), hemoglobin concentration (HGB), hematocrit (HCT), mean corpuscular hemoglobin (MCH), mean corpuscular volume (MCV), mean corpuscular hemoglobin concentration (MCHC), and RBC distribution width (RDW)), white blood cells (total white blood cell count (WBC count), neutrophil count (Neutro), lymphocyte count (Lympho), monocyte count (Mono), basophil count (Baso), and eosinophil count (Eosin)), and platelets (platelet count (PLT count) and mean platelet volume (MPV)). The meta-analysis results are available at: http://www.mhi-humangenetics.org/en/resources. We used a conservative α = 5x10 -8 to declare statistical significance. At this statistical threshold, our sample size (N = 308,572) provides >99% power to discover variants with MAF = 0.1% that explain >0.03% of the phenotypic variance.

Conditional analyses
To test whether the 31 rare variants newly identified in the meta-analyses are associated with blood-cell traits independently of other known genetic loci, we regressed out the effect of the known blood-cell variants previously identified in the first release of the UK Biobank [1] from the normalized blood-cell phenotypes. All these analyses were done per phenotype; that is we fitted 15 different models for each of the 15 blood-cell phenotypes tested. We then re-tested in the UK Biobank (using linear regression implemented in R) the association between the "residual" blood-cell phenotypes and genotypes at the rare variants identified in the meta-analyses. For instance, for hemoglobin, we conditioned the new rare coding or splice site hemoglobin variants on all variants (across the genome) previously reported to associate with hemoglobin levels. We provide the complete list of variants on which we conditioned in S5 Table per blood-cell trait. If the rare variants were not significantly associated with the residual phenotypes, we then ran pairwise conditional analyses to identify which previously known variants at the locus accounted for the association signal identified in this project. To declare statistical independence from previously reported hematological trait association signals, we used α = 0.002 (Bonferroni correction for 31 variants tested).

Secondary analyses for the rare PLG-rs145535174 variant
We sought association of the PLG variant (rs145535174) with platelet reactivity, as well as hemostatic and coagulation factors in the Framingham Heart Study (FHS) [49]. Genotyping for rs145535174 was conducted using the Illumina Human Exome BeadChip v.1.0 (Illumina, Inc., San Diego, CA) [50]. Multiple measurements were assessed for platelet reactivity [51], including maximum percentage platelet aggregation in response to agonists, i.e. ADP and epinephrine; minimal concentration of each agonist to produce a >50% aggregation response (EC50); and lag time in response to collagen stimulation. Hemostatic factors and coagulation factors, including antigens of plasminogen activator inhibitor-1 (PAI-1), tissue plasminogen activator (tPA), D-dimer, clotting factor VII (FVII) and von Willebrand factor (VWF), were measured using enzyme-linked immunosorbent assays [52][53][54] while fibrinogen levels were assessed using the Clauss method [52]. Association analyses in the FHS were conducted using either RAREMETALWORKER (http://genome.sph.umich.edu/wiki/ RAREMETALWORKER) or seqMeta (http://cran.r-project.org/web/packages/seqMeta/index. html). A linear mixed effects model that accounts for familial correlation was used and adjustments were made for age, sex and principal components. The phenotypes were log-transformed or inverse normal transformed, as needed.
We tested the association between the rare G-allele at PLG-rs145535174 and myocardial infarction (MI) and stroke in the UK Biobank. We used the "Health and Medical History" records to identify MI and stroke cases. For MI, we used the search terms "heart attack", "myocardial infarction", "acute myocardial infarction", "subsequent myocardial infarction" and "old myocardial infarction" to retrieve affected individuals. We used the terms "stroke", "ischaemic stroke" and "cerebral infarction" to define stroke cases. As controls, we excluded UK Biobank participants with MI, stroke or transient ischemic attack, percutaneous coronary intervention, coronary artery bypass graft surgery, peripheral vascular disease, congestive heart failure, and angina. For analysis of venous thromboembolism (VTE), we identified cases in the UK Biobank, the Montreal Heart Institute Biobank, and the Women's Health Initiative as individuals with pulmonary embolism or deep vein thrombosis. We tested the genetic association by logistic regression in PLINK or R, correcting for age, sex and the first ten principal components when available.

Secondary analyses for the rare IL33-rs146597587 variant
We identified asthma, allergic rhinitis (hay fever), and endometriosis cases using the detailed "Health and Medical History" UK Biobank participant records. All other individuals were assigned as controls. We tested the genetic association by logistic regression in R, correcting for age, sex and the first ten principal components. To determine if the rare IL33-rs146597587 variant is independent from the previously reported common SNPs at the locus, we conditioned on genotypes at these variants (rs343496, rs7032572, rs72699186, rs1342326, rs2381416, rs928413, rs10975519) and re-run the logistic regression model.

Pathway analyses
We used the default parameters in DAVID [55] to perform biological term and pathway enrichment analyses. For these bioinformatic analyses, we used as reference set all genes with at least one rare coding or splice site variant tested in the meta-analyses. We retrieved genes associated with RBC, WBC, or PLT traits from Table 1 (new independent variants from our study) and S2 Table (known positive controls), and tested their enrichment in biological pathways in comparison with the reference set. Due to the relatively low number of genes that were used as input for this kind of analysis, we lowered the initial, minimum number of genes in a seeding group to 3 (default = 4) to ensure that the clustering algorithm will include as many genes as possible into functional groups. All other parameters were left at their default values.  Table. Study-level association results for the 31 novel rare or splice site variants. The direction of the BETA is for the A2 allele. There are slight differences between the UK Biobank association results presented here and those in Table 1 because of the model used for conditional analyses in Table 1. Table. Functional annotation of the 31 new independent rare (minor allele frequency <1%) variants identified in this study that are associated with hematological traits. (XLSX) S5 Table. List of variants for condtional analyses. For each blood-cell trait, we conditioned the association of the novel rare or coding variants on genotypes at these known blood-cell trait variants. (XLSX) S6 Table. Conditional results in the UK Biobank (UKBB). When P cond >0.002 (Bonferroni correction for 31 variants tested), we performed pairwise conditional analyses with markers at the locus to identify the variant that account for the association signal. This variant is listed in the Tagging variant column, along with its minor allele frequency and functional annotation. MPV, mean platelet volume; MCH, mean corpuscular hemoglobin; RDW, red blood cell distribution width; PLT, platelet count; WBC, white blood cell count; MCHC, mean corpuscular hemoglobin concentration; Mono, monocyte; Neutro, neutrophil; HGB, hemoglobin; MCV, mean corpuscular volume; Eosin, eosinophil; RBC, red blood cell count; HCT, hematocrit. (DOCX) S7 Table. Association between PLG-rs145535174 and thrombotic events. All analyses were corrected for age and sex. UKBB, UK Biobank; MHIBB, Montreal Heart Institute Biobank; WHI, Women's Health Initiative. The direction of the odds ratios is for the rare G-allele at rs145535174 (allele A2). (DOCX) S8 Table. Prioritization of candidate genes. For each of the 31 novel rare coding or splice site variants presented in this study, we queried the corresponding loci in previous GWAS of blood-cell traits and highlighted previously prioritized candidate genes based on functional annotation (missense, splice site) or regulatory (eQTL) effect in the GTEx database. (DOCX) S9