Genetic Variants Modulating CRIPTO Serum Levels Identified by Genome-Wide Association Study in Cilento Isolates

Cripto, the founding member of the EGF-CFC genes, plays an essential role in embryo development and is involved in cancer progression. Cripto is a GPI-anchored protein that can interact with various components of multiple signaling pathways, such as TGF-β, Wnt and MAPK, driving different processes, among them epithelial-mesenchymal transition, cell proliferation, and stem cell renewal. Cripto protein can also be cleaved and released outside the cell in a soluble and still active form. Cripto is not significantly expressed in adult somatic tissues and its re-expression has been observed associated to pathological conditions, mainly cancer. Accordingly, CRIPTO has been detected at very low levels in the plasma of healthy volunteers, whereas its levels are significantly higher in patients with breast, colon or glioblastoma tumors. These data suggest that CRIPTO levels in human plasma or serum may have clinical significance. However, very little is known about the variability of serum levels of CRIPTO at a population level and the genetic contribution underlying this variability remains unknown. Here, we report the first genome-wide association study of CRIPTO serum levels in isolated populations (n = 1,054) from Cilento area in South Italy. The most associated SNPs (p-value<5*10-8) were all located on chromosome 3p22.1-3p21.3, in the CRIPTO gene region. Overall six CRIPTO associated loci were replicated in an independent sample (n = 535). Pathway analysis identified a main network including two other genes, besides CRIPTO, in the associated regions, involved in cell movement and proliferation. The replicated loci explain more than 87% of the CRIPTO variance, with 85% explained by the most associated SNP. Moreover, the functional analysis of the main associated locus identified a causal variant in the 5’UTR of CRIPTO gene which is able to strongly modulate CRIPTO expression through an AP-1-mediate transcriptional regulation.


Introduction
Cripto, also known as Teratocarcinoma-derived growth factor 1 (TDGF1), is the original member of the Epidermal Growth Factor-Cripto/Fibroblast growth factor Receptor Ligand 1/Cryptic (EGF-CFC) family of vertebrate proteins involved in embryo development [1][2][3]. Cripto has been isolated in human and mouse [3] and is a GPI-anchored membrane protein [4] that can function in both membrane anchored and soluble form [5,6]. It is involved in multiple signaling pathways, such as TGF-β, Wnt and MAPK/ERK pathways, it regulates essential steps in early embryogenesis and it is also involved in processes such as cell migration, epithelial-mesenchymal transition (EMT), stem cell maintenance, all processes which are implicated in cancerogenesis [7][8][9][10][11][12][13][14]. Cripto has also a role in angiogenesis, being able to enhance the proliferation, migration and invasion of human umbilical endothelial cells, to stimulate their differentiation into vascular-like structures in Matrigel and is also able to induce tumor neovascularization in vivo [13]. Cripto is expressed at very low levels in different adult tissue types and organs, among them a higher expression is detected in colon, skeletal muscle, heart, cortex of adrenal gland and cerebellum (http://www.biogps.org/ [15], http://www.proteinatlas.org/ [16]). Pathological re-expression is seen in a number of solid cancers. Most studies focused on the role of Cripto in breast and colorectal cancer [17][18][19][20], in inflammatory conditions and also in a macaque model of neuroAIDS [21]. Numerous studies have demonstrated correlation between high expression levels of CRIPTO and malignant transformation, tumor invasiveness, metastatic spreading and hence poor prognosis [17,[22][23][24][25][26][27]. In vitro and in vivo functional studies confirm a strong involvement of Cripto in cancer development and indicate that its effect on tumorigenesis might strictly depend on the cellular context in which it acts [28][29][30][31][32][33].
Moreover, many data indicate Cripto as a promising target for cancer therapy. Adkins et coll. demonstrated that the block of Cripto signaling with an anti-CFC domain antibody determined a strong inhibition of tumor cell growth in vivo [34]. Ever since, different approaches based on the use of oligonucleotides, vaccines or antibodies have been successfully applied to target Cripto by inhibiting its activity and/or expression in tumors and in neurodegenerative diseases [35]. Cripto inhibition by different approaches always resulted in inhibition of cancer cell proliferation in vitro and of tumor growth in vivo [36].
CRIPTO has been also detected at very low levels in the plasma of healthy controls, whereas significantly higher concentrations have been found in patients with breast, colon or cerebral tumors [19,37]. In both studies, tumor tissues and patient-matched blood samples have been analyzed, showing that CRIPTO high levels in the plasma correspond to re-expression of CRIPTO in tumor tissues [19]. These data suggest that CRIPTO, as the carcinoembryonic antigen (CEA) (another GPI anchored protein and a widely used tumor marker), is able to reach the bloodstream, being potentially released by tumor cells through GPI anchor cleavage. All together these data indicate that CRIPTO represents both a promising biomarker and a valid target for therapeutic intervention in cancer and that blood CRIPTO levels in humans may have clinical significance.
The two studies on the measure of circulating CRIPTO in the plasma published so far were both conducted on a very small group of individuals: the study of Bianco and coworkers analysed 21 healthy donors, 33 patients with colon carcinoma and 75 patients with breast carcinoma or benign breast lesions while the study of Pilgaard and coworkers included 28 Glioblastoma Multiforme (GBM) patients, 4 low-grade glioma patients and 8 healthy controls. In the first case no statistically significant correlations were observed between CRIPTO plasma concentration and various clinicopathologic variables, including tumor size, lymph node involvement and proliferative index and the degree of positivity for CRIPTO in tumor sections [19]. In the second case higher levels of the protein correlated with a shorter overall survival [37].
In the present study, CRIPTO serum levels were measured in a population-based sample from three isolated villages of the Cilento area, South Italy and a very high heritability (>80%) was estimated underscoring the importance of its genetic determinants. We present the first genome-wide association study (GWAS) for CRIPTO protein levels aiming at identifying genetic variants associated with the levels of circulating protein in the serum and we report the functional characterization of the main associated locus.

Results
Characteristics of the study participants are presented in Table 1. No difference was observed in the CRIPTO serum levels between men and women (p-value = 0.77 in the discovery; p-value = 0.65 in the replication). The median level of the protein was higher in the replication Loci associated with CRIPTO serum levels-discovery and replication A quantile-quantile plot for the 6,222,455 investigated autosomal SNPs in discovery GWAS revealed many more SNPs with low observed p-values than expected (S1 Fig.). 455 SNPs associated with CRIPTO serum concentration at p-value<5Ã10 -8 in the discovery stage were located on chromosome 3 in a region spanning 6.4 Mb (3p22.11-3p21.31), with the most associated SNP (rs3806702, p-value = 1.03Ã10 -159 ) located in the CRIPTO gene region. (Figs. 1 and S2). The C allele of this SNP was associated with higher levels of CRIPTO (CC = 673.8±29.0pg/ml; CT = 310.5±6.9pg/ml; TT = 46.7±4.4pg/ml) ( Table 2). When running a conditional GWAS adjusting for the most associated SNP on chromosome 3, no loci remain associated at the genome-wide significance level, but 700 SNPs were associated with p-value<1Ã10 -4 on the genome (S1 Table). Of those, linkage disequilibrium (LD)-based independent variants were defined if the pair-wise LD (r 2 ) was less than 0.01 and if they were separated by at least 1 Mb. In this way, 95 LD-based independent loci were identified that included SNPs associated with CRIPTO serum concentration at a p-value<1Ã10 -4 . For each of the independent loci, the SNP with the lowest p-value was carried forward to replication. Criteria for replication were defined as a p-value<0.05 in the independent replication sample, the effect in the same direction between discovery and replication, and a p-value in the meta-analysis of discovery and replication samples lower than that obtained in the discovery sample (see Methods). Table 2 shows the summary results for the main association and the additional 5 replicated loci for CRIPTO serum levels. Regional association plots provide a detailed overview of those loci (S2 Fig.). The proportion of serum CRIPTO variance explained by the replicated loci in the discovery sample was 87.0%, with 84.9% explained by rs3806702 alone and 2.1% explained by the remaining 5 associated SNPs. Similar results were obtained in the replication sample, with 87.1% of CRIPTO variance explained by the 6 associated variants.

Biological pathway analysis
To explore the functional relationship between CRIPTO and the associated genes we used the Ingenuity Pathway Analysis software (IPA). CRIPTO gene and genes closest to the replicated SNPs were included in the analysis (see Methods). Those 10 seed genes were reported in the S2 Table. The IPA analysis identified a significant (p-value = 1Ã10 -6 ) network of 35 molecules, 3 of which corresponded to CRIPTO associated loci (CRIPTO itself, GAS-7 and TNS-1) (Fig. 2). Function categories assigned to the whole gene network by IPA were: "cell to cell signaling and interaction, hematological system development and function, and immune cell trafficking". In depth, function annotation sub-categories significantly enriched in genes involved in the network and including CRIPTO, were: cell migration, proliferation of tumor cells, cell differentiation, and blood vessel development (S3 Table).

Functional element analysis in associated loci
To identify functional elements in the associated loci, ENCODE data related to chromatin modifications and hypersensitivity DNAse elements were analyzed in 3 cell lines (NTERA-2, HepG2, H1-hESC) selected as expressing CRIPTO mRNA. Among the replicated SNPs and variants in LD with them (r 2 >0.5), 3 variants (rs3791936 in the intronic region of TNS1, rs112481213 in the 5'UTR of CRIPTO, and rs61117007 in the intronic region of NTRK2) were located in promoter or enhancer histone marks and 1 of those (rs112481213) was also located in DNA hypersensitivity elements in the selected cell lines, suggesting a potential functional role of those associated variants.

Functional characterization of the associated rs112481213 variant
The top 8 associated SNPs in the CRIPTO gene region were all included in a unique LD block (S3A-S3B Fig.). Among them the rs112481213 variant (LD with rs3806702, r 2 = 0.99, p-value of association = 1.53Ã10 -158 ), reported to be in a regulative region from ENCODE data (see the above paragraph), was also predicted to create an AP-1 binding site by bioinformatics analysis with MatInspector program and TRANSFAC database (S3C Fig.). To explore the possibility that this SNP, located in the 5'UTR of CRIPTO gene, might influence CRIPTO transcription we tested the SNP allele effect on the transcriptional activity in the NTERA2 teratocarcinoma cell line, expressing high levels of CRIPTO [38]. Two constructs, both containing a 1,051 bp region upstream the CRIPTO ATG start codon (including almost all the 5'UTR and also 342 bp upstream the transcription start site), but differing for the rs112481213 allele (-222A/luc and-222T/luc, respectively) were transfected in NTERA2 cells. Overall, 15 SNPs were included in that region. As one of those SNPs, the rs3806703 variant was in linkage disequilibrium with rs112481213 (r 2 = 0.72), two additional constructs (-222T!A/luc, -222A!T/luc) were also produced by site-directed mutagenesis to discriminate the effect of each of the two variants on the transcription efficiency (Fig. 3A). The construct containing the rs112481213 A allele (-222A/luc) produced an about 5-fold increase (p-value = 9.2Ã10 -6 ) of the luciferase activity compared to the construct containing the T allele (-222T/luc) indicating that rs112481213 SNP affects the transcription (Fig. 3B). Also, the luciferase activity was reported to high levels in the site-mutated-222T!A/luc construct demonstrating that the main effect on the transcription can be attributed to the rs112481213 variant. Interestingly, the activity produced by the-222T!A/luc is significantly higher than that observed for the-222A/luc (p-value = 2.9Ã10 -2 ) and inverse results were obtained for-222A!T/luc and-222T/luc (p-value = 5.8Ã10 -4 ), indicating that rs3806703 might also have an effect, although modest, on the transcription. A cooperative role of rs112481213 and rs3806703 was also statistically supported by an interaction model tested for the two variants (β rs112481213 = -1.18, SE rs112481213 = 0.04, β rs3806703 = 0.27, SE rs3806703 = 0.06, β inter = -0.15, SE inter = 0.03, p-value inter = 7.08Ã10 -8 ). Therefore, the promoter activity data demonstrated that rs112481213 is a functional regulatory element of CRIPTO transcription which effect might be modulated by rs3806703. We next verified that the mechanism through which rs112481213 variant might influence CRIPTO transcription is the creation of an AP-1 consensus binding site, as predicted by bioinformatics analysis. Indeed, Electrophoretic Mobility Shift Assay (EMSA) using an AP-1 containing PC3 nuclear extract suggested that the A allele-containing oligonucleotide probe might robustly bind AP-1 complex while an oligonucleotide probe containing the T-allele weakly bound the complex (Fig. 4A). The AP-1 binding was confirmed by addition of specific antibodies for components of AP-1 complex, revealing a visibly supershifted band, representing a DNAprotein-antibody complex (Fig. 4B). All together, these data demonstrate the specificity of the observed DNA-protein interaction as well as a differential interaction of the AP-1 complex with the rs112481213 alleles.

Discussion
Cripto is a typical example of an oncodevelopmental gene having key functions in early embryogenesis, and being re-expressed in the adult during tumorigenesis. Cripto is a GPI-anchored membrane protein, that can also be cleaved and released in the medium and is able to induce cellular proliferation, EMT, migration, and invasion, as well as to stimulate tumor angiogenesis both in vitro and in vivo [36]. Cripto promotes oncogenesis via modulation of TGF-β ligand signaling and through mechanisms that are independent of TGF-β ligands and their signaling receptors [39]. The effect of TGF-β ligands and Cripto on tumorigenesis is also dependent on the cellular context [33,39]. Interestingly, Cripto protein is an obligatory coreceptor for some TGF-β family members such as Nodal, enabling them to bind to Activin receptorial complexes and activate Smad cascade and is also able to antagonize the signaling of other members of the TGF-β family, (i.e., Activins and TGF-β), inhibiting their antioncogenic effect [34,35,40]. Moreover, Cripto acts via separate, non-overlapping mechanisms to enhance the canonical Wnt/β-catenin signaling pathway by binding to low-density lipoprotein receptor-related protein (LRP) 5 and LRP6 [41] and to activate ras/raf/MAPK and PI3K/Akt pathways via c-Src [28]. More recently, novel Cripto-interacting proteins, also involved in cancer, have been identified including the chaperonin glucose regulated protein-78 (Grp78) and Notch1 [39].
In humans, CRIPTO is expressed at very low levels in both normal tissues and plasma, while its expression was found increased in patients with cancer (also in both tumor tissues and plasma), suggesting that CRIPTO blood levels might have a great clinical relevance [19,37]. However in these two articles only a small number of healthy volunteers was analysed and no data on a general population sample were reported.
Our study is the first GWAS of circulating CRIPTO levels. It was undertaken in 1,589 individuals from three population isolates of the Cilento area, South Italy and represents the largest survey of CRIPTO measurement in a population-based sample. In our study, the variability of the circulating protein, according to the heritability estimation, was found to be highly determined by genetic factors. The GWAS identified the strongest association on chromosome 3 at rs3806702 located in the CRIPTO gene region. GWAS conditional to that variant showed that this locus represents the main genetic contribution to the modulation of CRIPTO in the serum. Indeed, 85% of the inherited component of circulating CRIPTO levels is explained by the rs3806702 variant. In accordance with this finding, all individuals with CRIPTO protein levels below the detection threshold (34% of the entire sample) were found homozygous for the T allele of the rs3806702 variant. The lower levels of CRIPTO in the discovery sample compared to the replication sample can also be explained by the difference in allele frequency (for T allele 0.77 and 0.67 respectively).
A LD block included rs3806702 as well as the top 7 associated SNPs in the CRIPTO gene region. Some of those were reported by bioinformatics analysis as potential candidates affecting transcription factor binding sites. Among those variants, rs112481213, located in the 5'UTR of CRIPTO gene, was identified by functional experiments as a causal SNP for CRIPTO transcriptional regulation. Regulation of Cripto expression during embryogenesis and tumorigenesis was still incompletely defined. So far different binding sites have been found in the promoter region of the Cripto gene: for Smad-proteins [42], the T-cell factor/lymphoid enhancer factor (Tcf/Lef) [43], the Hypoxia-Inducible Factor 1 (HIF-1) [44], the Nkx2-5 early cardiac transcription factor [45], and the orphan nuclear Liver Receptor Homolog-1 (LRH-1) [46]. HIF-1 and Nkx2-5 are able to transcriptionally activate Cripto during cardiac differentiation, HIF-1 also activates CRIPTO expression in human embryonal carcinoma cells, following hypoxic conditions [44,46]. Conversely, CRIPTO is directly repressed by the orphan nuclear receptor germ cell nuclear factor (GCNF) which binds to the promoter during retinoic acid-induced differentiation of human embryonic carcinoma cells and by the miR-15a/16 cluster which bind to the 3'UTR of CRIPTO mRNA [47,48]. However, this is the first time that an AP-1 transcriptional activation of CRIPTO has been described. We have indeed demonstrated that CRIPTO expression is regulated by AP-1 transcription factor and that this regulation depends on rs112481213 genotype. Transcriptional activity data also suggest that in addition to rs112481213, rs3806703, another SNP present in that region, may have a role in modulating the CRIPTO protein levels, possibly through the involvement of GATA binding transcriptional factors. In support to this hypothesis, a statistical interaction between rs112481213 and rs3806703 was also found. Due to the complexity of Cripto gene regulation and its dependency on the specific biological context, additional regulatory mechanisms might occur in the case of cancer-related cell dysfunction.
Further, five additional loci were associated to CRIPTO serum levels at p-values<1Ã10 -4 independently from the main signal. Although these associations did not reach genome-wide significance in the discovery, likely because of lack of power of our study, they were replicated in an independent sample and might represent good candidate loci as modulators of the circulating CRIPTO. Moreover, three of the replicated variants were located in regions involved in regulative processes associated to chromatin accessibility.
A single network of 35 molecules including CRIPTO and two other associated loci, the growth-arrest-specific gene7 (GAS-7) and the Tensin1 (TNS-1), was identified by IPA analysis. The network included 30 genes implicated in cell migration, 18 genes involved in tumor cell proliferation, 25 genes in cell differentiation and, interestingly, 17 genes implicated in blood vessel development.
The analysis of CRIPTO associated loci showed that these are mainly linked to the MAPK/ ERK signaling pathway with ERK1/2 as one of the principal players of the network. Aberrant regulation of MAPK cascades is known to strongly contribute to cancer and other human diseases.
Like CRIPTO, TNS-1 and GAS-7, the other two genes in the associated regions present in the network, are both involved in breast and colon cancer [49][50][51][52]. TNS-1 binds to actin filaments [53] and serves as a link between signal transduction pathways and the actin cytoskeleton by forming a structural platform that regulates the assembly of focal adhesion components, phosphoproteins, and signaling molecules for processes such as cell migration [54]. TNS-1 is expressed in normal tissues [55] while its expression is greatly reduced in human breast, prostate, head and neck squamous cell carcinomas, and melanoma suggesting a role as tumor suppressor [56] as well as in the maintenance of cell polarization, and the suppression of invasion that are involved in metastasis [57]. The TNS-1 phosphotyrosine binding (PTB) 1 domain binds the cytoplasmic tail of beta-integrin, presumed to be the basis for focal adhesion localization. Interestingly, overexpression of Cripto in vitro and in vivo has been associated with increased expression of fibronectin and various integrins and with increased activation of focal adhesion kinase [58].
Similarly to TNS-1, GAS-7 binds actin and participates in cytoskeleton dynamics, executing different functions in different cellular processes, such as vesicle trafficking, cell migration and morphological differentiation [59,60]. GAS-7 hypermethylation has been found in breast and colon cancers whereas increased expression has been detected in medulloblastoma [61]. GAS-7 expression is regulated by ERK signaling pathway [62], in which Cripto is also involved.
In the same CRIPTO associated locus on chromosome 17, besides GAS-7, is also located the G-protein-coupled receptor (GLP-2R) gene [63][64][65]. Interestingly, GLP-2R activation also induces ERK1/2 MAP kinase activation and is able to both stimulate the expression of the immediate early genes c-Fos, c-Jun, JunB and Egr-1 and to activate AP1-driven gene transcription in a PKA-dependent manner [66,67].
Two other genes included in the associated loci, Myosin VA (MYOVA) gene on chromosome 15 and Neurotrophin tyrosine kinase receptor 2 (NTRK2) on chromosome 9, are both overexpressed in cancer. In particular, MYOVA is highly expressed in a number of highly metastatic cancer cell lines and metastatic colorectal cancer tissues and is able to interfere with metastatic capabilities by influencing cell migration. MYOVA expression is upregulated by the transcription factor Snail, one of the molecular switches for the EMT program involved in cancer metastasis [68]. NTRK2 has been found frequently overexpressed in human cancers, including pancreatic and prostate carcinoma, Wilms' tumor and neuroblastomas, particularly those with aggressive behavior and poor prognosis. As Cripto, NTRK2 activates both phosphatidylinositol-3-kinase (PI3K) and MAPK/ERK signaling.
In summary, our data showed that CRIPTO protein is measurable in the serum of the majority of individuals in a population-based sample. Further, we identified the largest genetic contribution to the CRIPTO variability and demonstrated that a functional variant located in the 5'UTR of CRIPTO gene is able to modulate CRIPTO expression through an AP-1-mediated transcriptional regulation. We also provided support for additional associated loci that will need to be confirmed in larger samples. Nevertheless, many of those associations converge in cancer phenotypes, mainly in cell movement and proliferation functions. As any association has been detected at these CRIPTO associated loci in large-scale cancer GWAS, further studies looking at CRIPTO variability in serum together with genotyping of the functional variant in specific cohorts of patients, focusing the analysis on specific cancer phenotypes, as metastasis formation, aggressiveness, prognosis, would be useful to better investigate possible associations between variants modulating CRIPTO protein levels and cancer features.

Population samples and CRIPTO measurement
The discovery sample includes 1,054 individuals recruited through a population-based sampling strategy in two small isolated villages of the Cilento region, South Italy (Gioi and Cardile) [69]. In silico replication was performed in additional 535 subjects from another village (Campora) of the same region [70]. The study design was approved by the ethics committee of Azienda Sanitaria Locale Napoli 1. The study was conducted according to the criteria set by the declaration of Helsinki and each subject signed an informed consent before participating to the study.
Blood samples were collected in the morning after the participants had been fasting for at least 12 h. Aliquots of serum were immediately prepared and stored at -80°C, and were subsequently used for the assessment of CRIPTO levels. CRIPTO (pg/ml) was measured using an enzyme-linked immunosorbent assay, according to the manufacturer's instructions (DRG Instruments GmbH, Germany). An intra-assay coefficient of variation of the CRIPTO measure of 7.74% was obtained from 10 times measurements of 10 serum samples. Individuals with CRIPTO levels below the detection threshold were included in the study and a value of 0pg/ml was assigned to them.
Mann-Whitney U test was used to compare median CRIPTO serum levels among the samples.
A normal quantile transformation was applied to the trait and the transformed trait was used in all statistical analyses. The heritability of CRIPTO serum levels was estimated by SOLAR software [71] using extended genealogies of discovery and replication population samples and adjusting the phenotype for gender and age.

GWAS and replication study
Genotyping was performed with 370K and Omniexpress Illumina chips, phasing and imputation were conducted separately by platform with the MaCH [72] and minimac (http://genome. sph.umich.edu/wiki/Minimac) software respectively, using 1000G v3 data as reference. SNP allele frequencies in Cilento samples versus 1000 Genomes reference allele frequencies for all genotyped SNPs were reported in the S4 Fig. Quality control filters applied before imputation were call rate >95% for SNPs and samples and minor allele frequency (MAF) >0.01. GWAS was carried out through a mixed model linear regression where the variance/covariance matrix is the genomic kinship to account for relatedness between individuals. Age and gender were used as covariates and an additive genetic model was considered. The analysis was performed with GenABEL package [73] for genotyped SNPs and ProbABEL [74] for imputed data. SNPs with imputation quality (Rsq in MACH) <0.4 or MAF <0.05 were excluded.
Conditional analysis was carried out in the discovery and replication samples adding the additive effect of rs3806702 as covariate in the association model.
To select linkage disequilibrium (LD)-based independent association signals among the CRIPTO associated SNPs, we conducted the clumping procedure implemented in PLINK [75] and picked the index SNPs with the most significant association p-value from each clumped association region based on the GWAS. The 1000G v3 genotypes were used as reference panel; the physical threshold for clumping was 1 Mb, and the r 2 threshold for clumping was 0.01.
To assess evidence for replication, test-statistics of discovery and in silico replication samples were meta-analysed using a fixed effect model weighted by inverse variance, using Metal [76]. SNPs were considered replicated if the SNP p-value was <0.05 in the replication sample alone, the effect was in the same direction between discovery and replication, and the p-value in the meta-analysis was lower than in the discovery sample.
The percentage of the variance of the CRIPTO levels explained by the replicated SNPs was calculated both in the discovery and replication samples. Three linear mixed effects models were fitted, in which the CRIPTO was regressed, respectively, on: 1) gender and age (basic model); 2) gender, age, additive effect of a single SNP (single SNP model); 3) gender, age, additive effect of each of the replicated SNPs (multiple SNP model). The variance explained by each SNP was calculated as the difference between the variance explained by the single SNP model and that explained by the basic model. Similarly, the variance explained by the replicated SNPs all together was estimated as the difference between the variance explained by the multiple SNP model and that explained by the basic model. The lmekin function (R package), which uses the genomic kinship matrix to correct for relatedness between individuals, was applied.
To test for interaction between rs112481213 and rs3806703, a linear mixed effect model, (including gender, age, additive effect of the two SNPs and the interaction between the two SNPs effects) implemented in the lmekin function (R package) was used. The variance inflation factor (VIF) was checked to be below ten to exclude collinearity problems [77].

Pathway analysis
For each replicated SNP, seed genes were selected as located within a region of 100 kb upstream and 100 kb downstream the region delimited by SNPs in LD (r 2 >0.5) with it. For the main associated locus on chromosome 3, CRIPTO gene was included as seed. Overall, 12seed genes were analyzed with Ingenuity Pathway Analysis software (IPA, Ingenuity Systems, www.ingenuity.com) to explore the functional relationship between the proteins encoded by those genes. IPA tests a set of genes for enrichment in defined canonical pathways or functions and generates de novo networks of interacting genes or gene products. IPA computes a p-value, based on a Fisher's exact test, that represents the likelihood of the core genes in a network and biological function being found together due to random chance. Direct and indirect interactions, a high confidence (experimentally observed or highly predicted) and a maximum size of 35 genes/proteins per network were used as parameters in the analysis.

DNA functional element analysis
The associated loci were investigated for presence of chromatin histone marks and hypersensitive DNAse elements using data from ENCODE included in Haploreg software (http://www. broadinstitute.org/mammals/haploreg/) [78]. The replicated SNPs and variants in LD with them (r 2 >0.5) were analyzed. Cell lines where the CRIPTO mRNA was reported to be expressed in ENCODE database (www.genome.gov/encode/) were selected for the analysis.
NTERA2 cells, cultured in Dulbecco's modified Eagle's medium F-12 (Gibco-Invitrogen) supplemented with 10% FBS at 37°C and 5% CO2, were transiently transfected using JetPRIME transfection reagent (PolyPlus Transfection) following the manufacturer's protocol. Briefly, 0.5 μg of either pGL3-Basic vector, -222T/luc, -222A/luc, -222T!A/luc, -222A!T/luc together with 10 ng of Renilla luciferase reporter plasmid (Promega) were cotransfected. Luciferase activity was assayed at 48h using the Dual-Luciferase Reporter Assay System (Promega) according to the manufacturer's protocol. Measurement of the firefly luciferase activity was normalized relative to the activity of the Renilla luciferase. Each construct was tested in triplicate in at least 3 independent experiments.

Electrophoretic mobility shift and antibody supershift assays
All reactions included double-stranded, biotin-labeled oligonucleotide probe at 40fmol concentration. EMSAs were performed by using the LightShift Chemiluminescent EMSA kit (Pierce Biotechnology) according to the manufacturer's protocol. PC3 cell line was used because of high expression of AP-1 components [79]. Nuclear extracts were prepared using Subcellular Protein Fractionation Kit for Cultured Cells (Pierce Biotechnology) according to the manufacturer's protocol. Nuclear extracts (5μg) were incubated at room temperature either with the biotin-labeled probe alone or with the biotin-labeled probe and 50 or 200-fold molar excess unlabeled competitor probe for 20 min, before loading on a 4% nondenaturing acrylamide gel and subjected to autoradiography. The following double-stranded biotin-labeled oligonucleotides were used as probe: AP-1 control 5'-biotin-CGCTTGATGACTCAGCCGGAA-3'; -222T probe 5'-biotin-GAAAGGCTGAGTCTCCAGCTC-3'; -222A probe 5'-biotin-GAAAGGCTG AGTCACCAGCTC-3'. Also, a scrambled oligonucleotide (5'-GAAAGGCTTGACGACCA GCTC-3') was used for competition at 200-fold molar excess.
Supershift assays were performed identically except for the addition of 3μg of antibody for 3h in ice before the addition of-222A probe. Antibodies used were anti-c-Jun, anti-JunB, anti-JunD, anti-Fra1 and anti-Fra2 [79] from Santa Cruz Biotechnology. HA-probe antibody against the influenza hemagglutinin (HA) protein (Santa Cruz Biotechnology), was used as negative control. For the rs3806702 the p-value of the discovery GWAS is reported, for the other SNPs p-values of the discovery conditional GWAS are reported. Each SNP is colored according to its correlation with the replicated SNP within the region as specified in the color scheme. Correlation structures correspond to hg19/1000 Genomes EUR Mar 2012. Plots were generated with LocusZoom [80]. The CRIPTO gene is reported as TDGF1. The genomic localization of the top 8 associated SNPs (red) respect to the TDGF1 gene is shown using the GRCh38 primary assembly as reference sequence. The transcript (NM_003212.03) and the coding (CCDS2742.1) sequence are also shown. The functional SNP rs112481213 is underlined and reported in italic. The black rectangle indicates the region of 1.052 bp, upstream the ATG start codon, cloned in the luciferase reporter pGL3-basic vector. The rs3806703 SNP (orange and with a dashed line) is present in the cloned region but is not included in the top 8 associated SNPs (LD < 0.8 for rs112481213). (B) LD between the top 8 associated SNPs is reported. The analysis is carried out on the 1000 genome data (European panel) using the Haploview software. The distance in kb from the ATG start codon is shown on the left column. (C) MatInspector analysis. The results are reported only for the SNPs for which: 1) SNP alleles differently affect the TF biding site. In the schema, "Change" of the allele is-1 if the Alt allele disrupts the TF binding site and is 1 if the Alt allele creates a new TF binding site; 2) The value of the matrix similarity is higher than 0.9; 3) The position of the allele in the sequence logo shows an information content > 0.2 (y axis of the sequence logo). The black star represents the position of the SNP in the TF consesus binding site.