Genetic factors define CPO and CLO subtypes of nonsyndromicorofacial cleft

Nonsyndromic orofacial cleft (NSOFC) is a severe birth defect that occurs early in embryonic development and includes the subtypes cleft palate only (CPO), cleft lip only (CLO) and cleft lip with cleft palate (CLP). Given a lack of specific genetic factor analysis for CPO and CLO, the present study aimed to dissect the landscape of genetic factors underlying the pathogenesis of these two subtypes using 6,986 cases and 10,165 controls. By combining a genome-wide association study (GWAS) for specific subtypes of CPO and CLO, as well as functional gene network and ontology pathway analysis, we identified 18 genes/loci that surpassed genome-wide significance (P < 5 × 10−8) responsible for NSOFC, including nine for CPO, seven for CLO, two for both conditions and four that contribute to the CLP subtype. Among these 18 genes/loci, 14 are novel and identified in this study and 12 contain developmental transcription factors (TFs), suggesting that TFs are the key factors for the pathogenesis of NSOFC subtypes. Interestingly, we observed an opposite effect of the genetic variants in the IRF6 gene for CPO and CLO. Moreover, the gene expression dosage effect of IRF6 with two different alleles at the same single-nucleotide polymorphism (SNP) plays important roles in driving CPO or CLO. In addition, PAX9 is a key TF for CPO. Our findings define subtypes of NSOFC using genetic factors and their functional ontologies and provide a clue to improve their diagnosis and treatment in the future.


Introduction
Cleft lip and cleft palate are orofacial disruptions of the normal facial structure that can cause problems with feeding, speaking, hearing and social integration among affected individuals [1,2]. Estimates have suggested that orofacial clefts occur in approximately 1 in 700 live births worldwide [2][3][4]. The majority of orofacial clefts lack additional defects in other tissues and are categorized as nonsyndromic cleft lip with or without cleft palate (CL/P) [5], which accounts for 70% of all orofacial clefts cases. CL/P cases include cleft palate only (CPO), cleft lip only (CLO) and cleft lip with cleft palate (CLP) [6]. Asian and Native American ancestry populations generally exhibit the highest birth prevalence rates for nonsyndromic orofacial cleft (NSOFC), whereas European ancestry populations have intermediate prevalence rates, and African ancestry populations have the lowest prevalence rates [7]. The overall prevalence of NSOFC in China is 1.67 per 1,000 newborns, with rates for CPO (2.7), CLO (5.6) and CLP (8.2 per 10,000 newborns) [8].
Both genetic factors and environmental risk factors contribute to the pathogenesis of NSOFC [9]. However, it has been difficult to identify specific etiologic factors for this disorder because the defects arise during early embryological development and because recurrence is both fairly common and unpredictable [1]. Moreover, because cleft lip and cleft palate are highly genetically heterogeneous [10], it is crucial to understand the genetic contributions of facial development in order to improve the clinical care of affected individuals. Genome-wide association studies (GWAS) have led to the discovery of at least 43 genes/loci associated with NSOFC [11][12][13][14][15][16][17][18][19][20], with genetic variants in the region of the IRF6 gene showing the strongest association with nonsyndromic CL/P among different populations [11,16,[21][22][23]. Given that the lip and primary palate have distinct developmental origins from the secondary palate, and that CPO, CLO and CLP have different phenotypes, it seems reasonable to hypothesize that these disorders might also harbor different genetic etiologies. However, most previous genetic studies of NSOFC used mixed samples of different subtypes (CL/P) rather than analyzing CPO, CLO or CLP separately. Only recently, a CLP GWAS identified 14 novel loci and suggested that the CPO, CLO and CLP subtypes harbor different genetic etiologies [24]. Therefore, the genetic association signals for specific CPO or CLO subtypes may have been missed in previous studies because the true signals for the specific subtypes could have been diluted by other subtypes.
On the other hand, the typical GWAS method has limited power to identify the associated genes with disease. For example, some genes that could be genuinely associated with disease status might not reach a stringent genome-wide significance threshold via typical GWAS [25]. The low signal-to-noise ratio, inherent in the majority of large datasets, presents a major difficulty in the analysis of complex biological systems [26]. Therefore, we combined the typical GWAS method with the gene network and ontology analysis methods to explore the genetic contributions to each NSOFC subtype.

Typical GWAS and replications identified nine novel loci responsible for NSOFC
To identify the NSOFC susceptibility genes/loci that are specific to CPO or CLO, we genotyped 935 unrelated CPO patients, 948 unrelated CLO patients and 5,050 unrelated control individuals of Southern Han Chinese ancestry using the Illumina HumanOmniZhongHua-8 BeadChip [27], which has 900,015 single-nucleotide polymorphisms (SNPs) in our discovery stage. Sample collection for the cohorts is shown in Table 1  After standard quality-control filtering for the participants and the SNPs (see Methods) and after excluding samples with poor quality and genetic heterogeneity using population stratification analysis, we obtained genotype data for 930 isolated CPO patients, 945 isolated CLO patients and 5,048 control individuals ( Table 2, the discovery cohort). The principal component analysis (PCA) results indicated that the remaining cases and controls were genetically well matched, without evidence of gross population stratification (S3 Fig). The genomic inflation factor (λ GC ) in the discovery cohort was 1.016 for CPO and 1.031 for CLO, suggesting that the association test statistics were not substantially confounded by the population substructure. We performed GWAS analysis using the logistic test with adjustment for sex and PCs (C1, C2, C3 and C4) using PLINK version 1.9 software. To further increase the genome coverage, we performed an imputation analysis to infer the genotypes of additional common SNPs (see Methods). The quantile-quantile (QQ) plots (using R package "qqman") of the association results are shown in S4 Fig. The Manhattan plot (using R package "qqman") of the P values is shown in Fig 1. CPO did not show a strong association signal in the discovery stage. If we used the normal GWAS significance cutoff, such as P < 5 × 10 −8 , we may omit some true association genes at the first replication. The association of CPO would be weak because the previous CPO GWAS analysis based on trios did not discover many CPO main effect genes/ loci. Previous studies suggested that the genetic model of CPO should be composed of many minor effect genes/loci. Therefore, we used the P < 9 × 10 −7 cutoff in the discovery stage for Table 2. Genes/loci identified for NSOFC by typical GWAS. Results of typical GWAS for SNPs significant at a multiple-testing correction level (P < 5        SNP selection for the first round replication. A total of 22 loci showed evidence of a significant association with CPO or CLO (i.e., surpassing P < 9 × 10 −7 ) at the GWAS discovery stage. All the association results of CPO and CLO adjusted for sex and PCs (PC1, PC2, PC3 and PC4) with P values less than e-5 in the discovery stage (935 CPO patients, 948 CLO patients and 5,050 control individuals) were listed in S1 Data. Additionally, 22 previously reported loci including 32 SNPs responsible for CL/P, CPO or CLP also showed marginal association (P < 0.05) with CPO or CLO (S1 Table). Among these loci, two reported CPO SNPs rs61776460 in 1p36.11 (GRHL3, P = 9.31× 10 −3 ) and rs604328 in 5p13.2 (UGT3A2, P = 3.15× 10 −3 ) are weakly associated with CPO in this study [28]. For the four reported CPO SNPs by Butali A et al. (rs80004662 and rs113691307 in CTNNA2, rs62529857 in SULT2A1 and rs2325377 in DACH1 [29], we could not find these SNPs in our chips. Two reported CLP SNPs rs908822 in 4q28.1 (LOC285419, P = 2.34 × 10 −3 ) and rs13317 in 8p11.23 (BAG4/FGFR1, P = 4.21 × 10 −3 ) [24] are weakly associated with CPO in this study. Two reported CL/P SNPs rs987525 in 8q24.21 (AC068570.1, P = 5.53 × 10 −3 ) [11,13] [13,16] and rs60417080 in 13q31.1 (RP11-501G7.1, P = 8.83 × 10 −3 ) [15,16,18] are also weakly associated with CPO in this study. Reported CLP SNPs rs560426 and rs66515264 in 1p22.1-21.3 (ARHGAP29, P = 1.79 × 10 −3 and P = 4.83 × 10 −3 respectively) [28], as well as rs1034832 in 8q21.3 (DCAF4L2/CTB-118P15.2, P = 2.60 × 10 −5 ) [24] are weakly associated with CLO. In the same locus, rs12543318 which was reported associated with CL/P, is also weakly associated with CLO (P = 3.79 × 10 −5 ) [15,16]. Three more reported CL/P SNPs, rs8049367 in 16p13.3 (RP11-462G12.2, P = 2.95 × 10 −4 ) [19], rs4791774 in 17p13.1 (NTN1, P = 5.87 × 10 −5 ) [19] and rs227731 in 17q22 (NOG, P = 4.35 × 10 −4 ) [15,16,18] showed weakly association with CLO in this study.
In order to replicate the associations that arose from our discovery cohort, we first selected 48 SNPs in the 22 discovered loci mentioned above for replication assays (see Methods, S2 Table) among 724 of the CPO patients, 781 of the CLO patients and 3,265 of the control individuals (the Southern Chinese replication cohort). Thirty-two SNPs in 15 loci maintained statistically significant associations with CPO or CLO (P < 0.05). We then genotyped these 32 SNPs in a secondary replication cohort of Northern Chinese ethnicity, comprising 417 unrelated CPO patients, 492 unrelated CLO patients and 1,832 unrelated normal control individuals.
Furthermore, to assess whether these CPO or CLO associated 13 genes/loci were also associated with CLP, we genotyped the 24 SNPs in the 13 genes/loci in an independent cohort consisting of 2,270 unrelated sporadic cases of CLP and 3,265 control individuals of the Southern Han Chinese population, along with 427 CLP patients and 1,832 control individuals of the Northern Han Chinese population (Table 2). However, we found that only the four genes (IRF6, MYCN, VAX1 and MAFB) that were previously reported to be associated with CL/P were significantly associated with CLP (P < 5 × 10 −8 ). These results demonstrate that the novel genes can only be identified by the specific NSOFC subtypes for association studies, and the genetic factors for CL/P that were previously identified are closer to the genetic factors for CLO and CLP than for CPO.

Expression dosage effect of IRF6 with different alleles associated with CPO or CLO
Consistent with previous findings [11,16,[21][22][23], we also confirmed that the IRF6 gene had the strongest association with both CPO and CLO, supported by the 537 statistically significant SNPs in this region (P < 9 × 10 −7 ) that were identified in the discovery stage (Fig 3A, S3  Table). After the meta-analysis of the results from the discovery and replication cohorts, the IRF6 gene region showed the most significantly associated SNPs with either CPO or CLO (Table 2B, Fig 1). However, the contributions of IRF6 to CPO and CLO were distinctly different based on the following findings: 1) the two alleles of the same associated SNP in the IRF6 showed an opposite direction of association between CPO and CLO, and 2) the SNPs in the IRF6 showed a stronger signal of association with CLO in comparison to CPO. For example, the T allele in rs72741048, had P = 3.07 × 10 −15 and odds ratio (OR) = 1.314 for CPO; while it had P = 8.22 × 10 −40 and OR = 0.575 for CLO (Fig 1, Table 2).
Convergent evidence from CPO and CLO points to the IRF6 gene as being a critical factor for the pathogenesis of NSOFC. Most of the SNPs associated with CPO and CLO are located Regional association for the nine newly discovered loci by typical GWAS. Regional association plots indicate the −log 10 P values of the genotyped SNPs of each locus. The sequence data were aligned to human hg19. The Y axis represents the negative logarithm (base 10) of the SNP P value and the X axis represents the position on the chromosome, with the name and location of genes in the UCSC Genome Browser shown in the bottom panel. The SNP with the lowest meta analyzed P value in the region was marked by a purple star. The colors of the other SNPs indicate the r 2 of these SNPs with the lead SNP. Plots were generated with LocusZoom using hg19/1000 genome build LD for ASN population (2014). https://doi.org/10.1371/journal.pgen.1008357.g002 Genetic factors define CPO and CLO subtypes in the 5'UTR and intronic regions of IRF6, which contain enrichment signals of active transcription start sites, transcriptions, enhancers and ChIP-seq chromatin profiling signals (S4 Table, S5 Fig). This suggests that these regulatory elements might control IRF6 gene expression.
We next sought to determine whether the associated variants for CPO and CLO can affect the levels of IRF6 gene expression in human CPO or CLO disease tissues. IRF6 gene expression in palatine uvula mucosa from 64 patients with CPO and the edge of the upper lip cleft from 49 patients with CLO were assessed, separately, using real-time polymerase chain reaction (PCR). For example, in the SNP rs72741048, T was a risk allele for CPO, but it was a protective allele for CLO. We also found that IRF6 was down-regulated 1.6 times in the TT genotype in comparison to the AA genotype in CPO, but it was down-regulated 4.1 times in the TT genotype in comparison to the AA genotype in CLO (Fig 3B). In addition, IRF6 was down-regulated about 2.0 times in the TT genotype in comparison to the AA genotype in most of the normal tissues (Genotype-Tissue Expression (GTEX) data) (see the list of URLs), suggesting https://doi.org/10.1371/journal.pgen.1008357.g003 that a relatively low expression level of IRF6 is a risk for CPO but is protective for CLO in the affected tissues. The genotype-specific expression patterns were also confirmed at the protein level by immunohistochemistry in the 31 patient-derived tissue samples (Fig 3C). It should be mentioned that the expression of IRF6 was higher in the edge of the upper lip cleft tissue (the main area affected in CLO condition) than in the uvula tissue (the main area affected in CPO condition) in the normal condition (Fig 3C). Although the sample size was limited, these test results suggest that different CPO and CLO phenotypes are partially associated with dosage imbalances in the gene expression of IRF6 in the disease-related tissues.

Five genes/loci were identified by gene network and ontology analysis, and further replications
Our typical GWAS results revealed the importance of developmental transcription factors (TFs) in the regulation of the disease direction (CPO versus CLO), as evidenced by the association signals found near the following TFs: WHSC1, PAX9, FOXC2, IRF6, MYCN, VAX1 and MAFB. Given the limited sample size in the genetic study, some true disease genes could be missed by the typical GWAS analysis due to insufficient power. To explore more potential associated genes for CPO or CLO in our data, we conducted the following five steps for further analysis: 1. We conducted first round candidate of a set of NSOFC candidate genes from published references [1,2,5], the GWAS Catalog, the Human Gene Mutation Database, Phynolizer [30] and Human Phenotype Ontology (see URLs; see Methods).
2. We conducted second round candidate of NSOFC genes using GeneMANIA network analysis [31] and ontology, as well as pathway analysis using Database for Annotation, Visualization and Integrated Discovery (DAVID) [32] with the first round candidate NSOFC genes as queries to obtain more functional related genes. A total of 243 genes were enriched either by interaction with the genetic factors or in the same ontology as the candidate genes for NSOFC. The gene interaction and ontology analysis of the enriched genes are shown in S5 Table, S6 Fig and S7 Fig.   3. We next explored whether these candidate genes were associated with CPO or CLO, using our GWAS datasets. In addition to the identified association genes/loci in our typical GWAS analysis as we described in the first part of our results, we also found an additional 44  5. We conducted the final meta-analysis of the discovery and replications cohorts.
Five novel genes/loci surpassed genome-wide statistical significance (P < 5 × 10 −8 ) in the final meta-analysis of the discovery and replications cohorts (Table 3), with three for CPO and two for CLO. For CPO: rs72688980 in 4q32.1 between CTSO and PDGFC (P = 1.89 × 10 −8 , Table 3. Genes/loci for NSOFC identified by gene network and ontology analysis and further replications. CPO, CLO or CLP associated SNPs selected by gene network and ontology and replication analysis which reached significant at a multiple-testing correction level (P < 5 × 10 −8 ) in CPO or CLO by meta-analysis of discovery and replication results.

Discussion
In this study, we showed the advantage of using GWAS combined with gene network and ontology analysis to identify genetic factors for NSOFC subtypes. We identified 13 genes/loci for NSOFC by using the typical GWAS method, which followed discovery (P < 9 × 10 −7 ) and replication steps to obtain the significant genes/loci (P < 5 × 10 −8 for combined results of discovery and replications). We also identified five additional genes/loci for NSOFC (P < 5 × 10 −8 for the combined discovery and replication cohort results), which might be missed by typical GWAS analysis, by the combined method. In the later method, we applied a combination strategy to mine the possible true disease genes by using network and ontology analysis plus further genotype validation. In all, we identified 11 genes/loci for CPO, 10 of which were novel. We also identified nine genes/loci for CLO, five of which were novel. Among these 18 genes (14 of which were novel), IRF6 and DLK1 (novel) were associated with both CPO and CLO, IRF6 was associated with CLP and MYCN, VAX1 and MAFB were associated with CLO and CLP. By comparing the P values and ORs of CPO, CLO and CLP subtypes, we found that the genetic pattern of CLO is more similar to that of CLP than that of CPO (Fig 4A and 4B). This phenomenon was consistent with a recent study by Carlson JC et al., which showed more significant association signals in CLP vs. CP group rather than CL vs. CLP group [28]. Besides IRF6 locus, CLO also shared VAVX, MAFB and MYCN loci. This finding is consistent with a previous CLP GWAS that CLP and CLO shared more genetic factors [24].
Our results also revealed the importance of developmental TFs in the pathogenesis of these three NSOFC subtypes (Fig 4). We identified 12 genes/loci containing TFs that contribute to CL/P. These TFs include seven families: We suggested that these groups of TFs and their target genes function in a coordinated manner to direct palate and lip tissue specialization during embryonic development and intermittently in response to external signals. Besides these TFs, Carlson JC et al. also identified validated TFs PAX7 (1p36.13) and GRHL3 (1p36.11) were also associated with CLP vs. CP group, and transcriptional corepressor TLE1 (9q21.32) was also associated with CLP vs. CP group [28].
Using gene network analysis, we found that a total of 243 genes were either enriched by interaction with the genetic factors or in the same ontology as the candidate genes for NSOFC with different biological functions. Although we confirmed only five genes that surpassed the genome-wide statistical significance in the final meta-analysis of the discovery and replications cohorts, we could not exclude the rest of the genes as candidates with NSOFC for the following reasons: (1) the power might be insufficient to archive the genome-wide statistical significance level due to the limited number of samples in this study, (2) the genetic contributions responsible for NSOFC were missed in the GWAS because of the limited covered regions in the genome of the chip design and (3) their pathogenesis roles for NSOFC might be at the functional modulation level, not at the genetic variation effect, through interaction with the genetic factors.
Strikingly, the directions of the associated SNPs in the IRF6 gene with CPO and CLO were opposite, suggesting that these two subtypes have different pathogenesis with IRF6, probably by regulating its expression via the associated SNPs. For example, for one of the leading SNP rs72741048 in this region, OR _T for CPO was 1.314 suggesting its risk effect, while the OR _T for CLO was 0.575 suggesting its protective effect. IRF6 belongs to a family of transcription activators that share a highly conserved, helix-turn-helix, DNA-binding domain. The AP-2α enhancer was previously reported to be associated with CLO through binding with the rs642961 site in the intron of IRF6 to regulate its expression [33]. It is likely that the expression of IRF6 is precisely controlled by the coordination of multiple regulatory elements located in the associated SNPs in the gene region [34,35]. The phenotypes of IRF6 mouse models further suggest that the gene dosage balance might be critical for palate and lip development. IRF6null mouse embryos showed oral cavity adhesion [36], implying that the normal palate development was impaired without IRF6. However, approximately 22.0% of IRF6 transgenic mice exhibit an absence of calvaria, but they retain normal palatal shelf fusion. In contrast, 2.7% of these mice had a cleft lip [37], which supports the high expression level of IRF6 as a risk for CLO. Further investigations are needed to dissect the pathogenesis of the dosage effect of IRF6 for the CPO and CLO.
We searched the GWAS Catalog and found five studies based on CPO-related GWAS [12,15,[38][39][40]. These studies did not discover IRF6 locus in CPO perhaps because of 1) a small sample size in the discovery stage and 2) a large degree of population mixture in the discovery stage. From our results, we can see that the power of IRF6 locus in CPO is much lower than that in CLO, so it may be beyond the cutoff threshold of power for GWAS discovery in the small and mixed discovery samples used in the preceding CPO GWAS. However, at least three studies suggested some clues to the association of IRF6 locus with CPO or suggested the opposite OR directions in the SNPs of IRF6 locus in CLO versus CPO. In 2008, Rahimov F. et al. reported that the rs2235371 and rs642961 haplotypes located in the IRF6 region are associated with CL/P. They further validated them in CLO and CPO. These results suggest opposite OR directions in CLO versus CPO in most of the study populations (Norway, Denmark, EURO-CRAN, Europe and Philippines) [33]. In a 2016 Chinese GWAS of nonsyndromic CLP in the discovery stage, the authors identified that IRF6 locus was associated with CLP. Further, they replicated rs861020 in the intron of IRF6 in the replication cohorts of CLP, CLO and CPO. They found that the A allele of this SNP is significantly associated with CLP versus CLO (OR = 0.72, P = 2.05 × 10 −9 ) or CLP versus CPO (OR = 1.51, P = 8.69 × 10 −11 ), suggesting opposite OR directions in these two diseases [24]. Another study reported that rs2235375 in the intron of IRF6 was associated with CPO but not with CLO and CLP in a South Indian population [41].
After the IRF6 gene locus, the PAX9 locus in 14q13.3 was the second strongest associated locus for CPO. In the PAX9 locus, the associated SNPs are located in the intron of the SLA25A gene,~100kb downstream of PAX9. A previous study indicated that mice with a deleted SLA25A gene presented obvious phenotypes of CPO via reduction of PAX9 expression [42]. PAX9 encodes a key TF that was reported to play a role in organs derived from neural crest mesenchyme [43]. PAX9 was required for secondary palate development in mice [44][45][46]. The absence of teeth and the formation of a cleft secondary palate in PAX9-deficient mice have been reported [45]. Furthermore, mutations in PAX9 can cause tooth agenesis in humans [47]. Therefore, it is likely that the CPO associated genetic variants decreased the SLA25A expression and further down-regulated PAX9 expression to associate with the disease. This needs to be addressed in future work.
In a recent GWAS, Yu et al. found in a Chinese population 14 loci based on the nonsyndromic CLP in the GWAS discovery stage, but not CPO or CLO [24]. In the current study, we used CPO and CLO in the discovery stage to find new loci for each group. This is very different at the beginning of the study design from Yu et al.'s study. Because the phenotype CLP is different from CPO or CLO, the genetic factors of these phenotypes may be different; that is the question we raised in this paper. We found that the main effect loci, such as IRF6, MSX1, VAX1 and MAFB, are shared by CLP and CLO. We think the difference between CPO, CLO and CLP are caused by minor effect multi-genetic factors rather than heterogeneity among populations.
In summary, our study advanced current understanding of the genetic architectures of CPO, CLO and CLP. These findings defined the NSOFC subtypes using genetic factors and their functional ontologies. They also provide a clue to improving a diagnosis and treatment of these conditions in the future. However, the current understanding of the biology of these processes in humans remains largely unknown, and it is expected to be complicated. Further functional studies of the genes for NSOFC identified in this study should be conducted to promote drug development and novel therapeutic approaches to treat the disorder.

Subjects
All of the CPO, CLO and CLP cases were nonsyndromic. The CPO cases included the complete cleft palate (the hard and soft cleft palate) and the soft cleft palate. The diagnoses, which were made by professional maxillofacial doctors before surgery, were based on a series of tests, including electrocardiogram, radiography, biochemical test, physical examination, speech evaluation, ultrasonic test and genetic counseling as necessary. Only those controls who had no family history of congenital disease were included in this study. All of these evaluations were done by at least three doctors, including a surgeon, a speech clinician and a geneticist.
A three-stage GWAS for CPO and CLO was conducted, with further replications for CLP. The discovery stage involved 935 CPO patients, 948 CLO patients and 5,050 control individuals (cohort 1). The first replication study was performed among an additional 724 unrelated CPO cases, 781 CLO cases and 3,265 controls (cohort 2). The second replication study was performed among an additional 417 unrelated CPO cases, 492 CLO cases and 1,832 controls (replication cohort: Northern Han Chinese, cohort 3). The first replication for CLP involved 2,270 CLP cases and 3,265 controls (replication cohort: Southern Han Chinese, cohort 4). The second replication study for CLP was performed among an additional 427 unrelated CLP cases and 1,832 controls (replication cohort: Northern Han Chinese, cohort 5). The same controls were used for CPO, CLO and CLP. No obvious geographic areas or genetic differences occurred in this study. In the discovery stage, the CPO and CLO samples were collected by the same team in the same hospital of Southern Han Chinese people (West China Hospital of Stomatology, Chengdu). The control samples were also collected in the same city of Southern Han Chinese people (Sichuan Provincial People's Hospital, Chengdu). The same controls were used for CPO and CLO. For the CPO, CLO and control samples, we used the same chip for genotyping (HumanOmniZhongHua-8 BeadChip, Illumina). In the replication studies, we enrolled the CPO, CLO, CLP and control samples in the same hospitals (CPO, CLO, CLP and controls were enrolled together).

Ethics statement
The study was approved by the institutional ethics committee of West China Hospital of Stomatology of Sichuan University and Sichuan Provincial People's Hospital and was conducted according to the Declaration of Helsinki principles [48]. All controls were healthy individuals without NSOFC or family history of NSOFC (including first-, second-and third-degree relatives). Written informed consent was obtained from all the participants or their guardians. Approximately 4 ml of venous blood was collected from each participant and placed in a tube containing ethylenediaminetetraacetic acid (EDTA) as the anticoagulant. Genomic DNA was extracted from peripheral blood lymphocytes using the standard sodium dodecylsulfate (SDS)-proteinase K-phenol/chloroform method.

Genotyping and quality control in the GWAS
The discovery cohort DNA samples were genotyped by Jinneng Biotech (Shanghai, China) using HumanOmniZhongHua-8 BeadChip (Illumina), according to the manufacturer's protocol, with a starting number of 900,015 SNPs. Any SNPs with call rates of less than 90% were removed from further analysis. SNPs located on the X and Y chromosomes, mitochondrial SNPs, and copy number variant probes were removed from further analysis, in keeping with current GWAS practices. After quality filtering and cleaning, 870,261 SNPs remained in the association analysis for CPO and CLO. Full details of the experimental workflow are provided in S2 Fig. Sex and PCs (PC1, PC2, PC3 and PC4) were adjusted as covariates in the logistic model in PLINK. Close relatives among the participants were calculated using genome-wide IBS/BD among all the samples using PLINK (-genome). We found that z0 was very close to 1 and z1 was very close to 0 in all the samples. Also, there are no related people (IBD pihat > 0.2) within these samples.

Association analysis
After chip genotyping, PCA was performed for both CPO and CLO separately to remove samples with outlying samples from further analysis using the R statistical software package (see URLs). A total of 930 CPO cases, 945 CLO cases and 5,048 control individuals in the discovery cohort passed quality control for the GWAS discovery stage. Next, we examined potential genetic relatedness on the basis of pairwise identity by state for all of the successfully genotyped samples using PLINK version 1.9 software. The genomic inflation estimate (λGC) was calculated for variants with MAF > 1% using only directly genotyped SNPs using PLINK 1.9 (see URLs). Single-marker association analyses were performed using PLINK 1.9 adjusted for sex and MDS (PC1, PC2, PC3 and PC4) as covariates with SNPs showed missing values < 10%, MAF > 1% and HWE P > 10 −6 .

Genotype imputation
Genotypes were converted to PLINK binary format, and SNPs with missing values > 10%, MAF < 1% and HWE P < 10 −6 for phasing were excluded (see URLs). The clean data were then phased using SHAPEIT2 [49] (see URLs). After that, the dataset was imputed with 1000 Genomes phase 1 (version 3) of CHB (Han Chinese) and CHS (Southern Han Chinese) (hg19) with Minimac3 [50] (see URLs) with r 2 > 0.6. The association analysis of the imputed dosage data was calculated using PLINK version 2.0 using sex and MDS (PC1, PC2, PC3 and PC4) as covariates.

SNP selection for replication studies
SNPs showing an association with CPO and CLO exceeding P � 9 × 10 −7 in the GWAS discovery stage were included in the replication stage and analyzed in a similar manner to the discovery stage. In total, 48 SNPs were selected for replication analysis.

Genotyping and quality control in the replication studies
Genotyping of the SNPs selected for the replication studies were conducted using the Sequenom MassARRAY system genotyping as previously described [27]. The association analysis of the replication genotype data was conducted using PLINK 1.9 adjusted for sex.
Meta-analysis PLINK 1.9 was used to perform combined meta-analyses of the GWAS discovery and replication data sets for CPO and CLO. The two CLP replication datasets were also combined using the PLINK 1.9 meta-analysis method.

Manhattan plots, QQ plots and LocusZoom plots
Manhattan plots were generated using QQ, and Manhattan plots for the GWAS Data R package were generated using SNPs with imputed P values less than 0.05 (see URLs). QQ plots were generated using QQ, and Manhattan plots for the GWAS Data R package were generated using all the direct genotyped SNPs (see URLs). LocusZoom plots were generated online from LocusZoom (hg19 November 2014 ASN population) using SNPs with P values less than 0.05 (see URLs).

Epigenetic annotation
Epigenomic annotation of genetic variants for 31 tissues was performed using the Roadmap Epigenome Browser, which was based on the WashU (Washington University) Epigenome Browser and integrates data from both the NIH (National Institutes of Health) Roadmap Epigenomics Consortium and ENCODE (Encyclopedia of DNA Elements) in a visualization [51] (see URLs).

HI-C data browser
Hi-C contact matrices were visualized as heatmaps using the 3D Genome Browser [52] (see URLs). The TAD dataset of normal human epidermal keratinocytes and juvenile foreskin primary cell Hi-C data were used. The hg19 SNP region of the regional association maps was used as the input region.

Gene expression
Human tissue samples were obtained from CPO and CLO patients during surgical cleft repair. Tissues were collected from the rim of the uvula of CPO patients and from the edge of the upper lip cleft of CLO patients. According to the principles of interdisciplinary team care for cleft lip and palate, the patient's age at operation is usually between three and six months for CLO patients and between one and two years for CPO patients. We collected 211 tissue samples (105 palate tissues, 106 lip tissues) for gene expression analysis. Because the patients were young at the time of surgery and the lesion range is relatively limited, the sample size we collected was very small (about 3 mm � 4 mm � 2 mm) for each patient. In almost all cases, only one tissue sample was collected from one patient. Written informed consent was obtained from all guardians on behalf of the patients. All tissues were stored in liquid nitrogen immediately after incision and then transferred to -80˚C for storage.