Whole exome sequencing reveals HSPA1L as a genetic risk factor for spontaneous preterm birth

Preterm birth is a leading cause of morbidity and mortality in infants. Genetic and environmental factors play a role in the susceptibility to preterm birth, but despite many investigations, the genetic basis for preterm birth remain largely unknown. Our objective was to identify rare, possibly damaging, nucleotide variants in mothers from families with recurrent spontaneous preterm births (SPTB). DNA samples from 17 Finnish mothers who delivered at least one infant preterm were subjected to whole exome sequencing. All mothers were of northern Finnish origin and were from seven multiplex families. Additional replication samples of European origin consisted of 93 Danish sister pairs (and two sister triads), all with a history of a preterm delivery. Rare exonic variants (frequency <1%) were analyzed to identify genes and pathways likely to affect SPTB susceptibility. We identified rare, possibly damaging, variants in genes that were common to multiple affected individuals. The glucocorticoid receptor signaling pathway was the most significant (p<1.7e-8) with genes containing these variants in a subgroup of ten Finnish mothers, each having had 2–4 SPTBs. This pathway was replicated among the Danish sister pairs. A gene in this pathway, heat shock protein family A (Hsp70) member 1 like (HSPA1L), contains two likely damaging missense alleles that were found in four different Finnish families. One of the variants (rs34620296) had a higher frequency in cases compared to controls (0.0025 vs. 0.0010, p = 0.002) in a large preterm birth genome-wide association study (GWAS) consisting of mothers of general European ancestry. Sister pairs in replication samples also shared rare, likely damaging HSPA1L variants. Furthermore, in silico analysis predicted an additional phosphorylation site generated by rs34620296 that could potentially affect chaperone activity or HSPA1L protein stability. Finally, in vitro functional experiment showed a link between HSPA1L activity and decidualization. In conclusion, rare, likely damaging, variants in HSPA1L were observed in multiple families with recurrent SPTB.

Introduction Preterm birth (PTB), defined as birth before 37 completed weeks of gestation, is a major global public health concern. Worldwide, over 15 million infants (more than one in ten babies) are born preterm and of those, more than one million die from complications related to preterm birth each year [1]. Preterm birth and its complications are the leading cause of neonatal deaths and have become the major cause of death among children under five years old [2]. Moreover, preterm infants are at increased risk, not only of short-term complications but also of life-long disabilities, such as respiratory and cognitive disorders [1]. Preterm birth also increases the risk of adult-onset disorders, such as obesity, diabetes and cardiovascular diseases [3,4]. Currently, there is no generally effective method for prevention of preterm delivery.
The majority (~70%) of preterm births occur after spontaneous onset of labor, with or without preterm prelabor rupture of the membranes (PPROM) [5]. Most spontaneous preterm births (SPTBs) are idiopathic [1,5]; however, recurrence of preterm birth among mothers and within families indicates that genetic factors may be important. Genetic factors are estimated to account for 25-40% of the variation in birth timing [6], with the maternal genome playing chappell@cchmc.org). Some restrictions may apply for the protection of privacy. Any use of the Danish Replication WES data must be approved by the Danish National IRB and the Danish Data Protection Agency through an application facilitated by the Danish Principal Investigator (kchristensen@health.sdu.dk) at the University of Southern Denmark. Some restrictions may apply for the protection of privacy. For the validation GWAS datasets: the summary statistical outcomes of the top 10,000 SNPs for the 23andMe data have been deposited in the GeneStation repository (www.genestation.org/analysis/gwas/Zhang_2017/ discovery), and summary statistics of the complete data set are available on request from 23andMe. Access to the DNBC individual-level data can be obtained through dbGaP Authorized Access portal (https://dbgap.ncbi.nlm.nih.gov/dbgap/aa). The Norwegian cohort data underlying this study was obtained from a third party and is a subject to some legal restrictions. The data originates from the Norwegian Mother Child cohort (MoBa), which is controlled by the MoBa Scientific Management Group. The research data can be accessed via electronic application forms at http://www.fhi.no/ en/online-publications/data-access-fromhealthregistries-health-studies-and-biobanks/datafromlarge-health-studies/research-and-dataaccessfrom-the-n/. Data can be requested of all interested researchers qualifying for the requirements established by the MoBa Scientific Management Group. For more information please contact datatilgang@fhi.no. the major, but not only, role in predisposition to preterm birth [7][8][9][10][11]. Despite many studies of the genetics of SPTB [6,12,13], only a few variants have been robustly associated with this outcome [14], and their functional implications are unclear.
Previous genome-wide association studies (GWAS) of SPTB have involved common variants, but they explain only a small portion of the genetic risk. The role of rare variants in SPTB has been essentially unexplored. Whole exome sequencing (WES) in families offers a comprehensive method to identify rare variant associations with disease, including almost complete coverage of the protein coding regions of the genome. Even though studies of rare variants underlying Mendelian disorders have revealed novel genes [15,16], using WES to study complex multifactorial syndromes remains a challenge [17]. Previous sequencing studies of PTB [18] or PPROM [19,20] have focused only on a set of candidate gene regions and, consequently, have missed the majority of the coding regions of the genome. In contrast to whole genome sequencing, WES is more cost effective and has the advantage of providing more easily interpreted results.
We performed a WES study using families under the hypothesis that familial recurrence is influenced by rare variants with large individual effects on SPTB susceptibility. Such an approach has the potential of identifying genes containing rare variants shared in these multiplex families, as well as genes in pathways common across families. This method applies a hypothesis-free testing approach to identify potentially novel candidate genes for SPTB.

Ingenuity variant analysis and pathway analysis
Seventeen mothers from seven northern Finnish multiplex families (Discovery cohort) and an additional 192 mothers from 95 Danish families (Replication cohort) were sequenced using WES. The pedigrees of the multiplex Finnish families are shown in S1 Fig. For the Discovery cohort, all samples (except one that was excluded from subsequent analyses) passed the quality control parameters used for the clinical exome sequencing at the CMH; quality control cutoffs were 85% reads aligned, 80% aligned with alignment quality of 20 or greater. For these samples, the mean and median heterozygous/homozygous variant ratios were 0.765 and 0.748, respectively. Prior to variant filtering in Ingenuity, the mean/median numbers of nucleotide variant calls per individual were 318,767/326,474 (Discovery cohort) and 221,682/184,381 (Replication cohort). This difference in variant calls between the Discovery and Replication populations is likely due the fact that populations were sequenced using different Next Generation Sequencing platforms, Illumina for Discovery cohort and Complete Genomics for Replication cohort, and their respective primary quality control measures and variant calling methods were thus different. Mean transition/transversion (Ti/Tv) ratios were 2.2 and 2.0 for Discovery and Replication exomes, respectively. An overview of the WES workflow is presented in Fig 1. Discovery population. An initial nucleotide variant analysis (Ingenuity) was performed in ten mothers (out of 17) with at least two recurrent spontaneous preterm deliveries. These mothers were selected for the first analysis because they represent the most severe cases, i.e., multiple preterm deliveries. We further required that at least three of the cases (30%) had a rare variant, that passed the quality control and prioritizing filters (e.g. likely damaging heterozygous variants), in the same gene. For our purposes, rare variants are defined as minor allele frequency (MAF) <1% in 1000 Genomes, ExAC or in NHLBI ESP. Requiring that at least 30% of the cases had a variant in the same gene, ensured that identified genes are present in cases from at least two different families. This analysis yielded 1,510 different variants in 406 genes. Pathway analysis of these variants in the Ingenuity Variant Analysis identified 64 pathways with p<0.01 under a maternal dominant genetic model (S1 Table). The three most significant pathways (p 9.80e−7), observed in all ten mothers, were the glucocorticoid receptor signaling, the estrogen receptor signaling and the AMPK signaling pathways. The genes mapping to these pathways are shown in Table 1; many of these genes were present in more than one pathway. Furthermore, when we included the 13 mothers with spontaneous singleton preterm deliveries, and raised the requirement for candidate genes (harboring rare variants) to be shared by at least four mothers, thereby requiring representation in at least two different families, the same three pathways remained within the top five most significant pathways (for glucocorticoid receptor signaling, estrogen receptor signaling and AMPK signaling pathways, the p-values were 3.51e−8, 2.75e−6 and 2.99e−8, respectively).
We used a family-based analysis with a dominant genetic model with at least two affected individuals within a family (five families) for additional Ingenuity Variant Analysis. Within a family, affected mothers shared on average 444 variants (range 278-691) in 243 genes (range 173-381). Individual pathway analysis of each family revealed the glucocorticoid receptor signaling pathway as common to all families and the estrogen receptor signaling pathway in three families with p<0.01 (S2 Table). Several genes in the glucocorticoid receptor signaling pathways (AR, HSPA1L, NCOA3 and NCOR2) and in the estrogen receptor signaling pathways (NCOA3 and NCOR2) were common to more than one family. These genes all had rare variants likely to be disease causative [21], and were present at a relatively low frequency in the CMH database that was used as internal control data. This analysis only identified two families with p<0.01 for the AMPK signaling pathway, and none of the genes in this pathway were common to more than one family or passed further requirements relative to the control dataset.
Replication population. The Ingenuity Variant Analysis of the replication samples (93 sister pairs) had an average of 807 (range 504-1553) rare nucleotide variants in 593 genes (range 357-1112) shared by the two sisters within each family and assumed to be dominantly acting. The glucocorticoid receptor signaling and estrogen receptor signaling pathways were observed in 75 and 79 families with combined median p = 0.0003 and p<0.0001, respectively (S3 Table). Within these two pathways, the genes from the Discovery analyses (AR, HSPA1L, NCOA3 and NCOR2) were found in between 3 and 12 families from the Replication dataset (Table 2). In an additional two Danish families with affected sister triads, there were 39 common pathways to the two families; the estrogen receptor signaling pathway had the second lowest average p-value (p = 0.0003) and the glucocorticoid receptor signaling pathway had the fifth lowest average p-value (p = 0.0009).

Comparison of data obtained from different WES software tools and comparison of rare variants shared within families in Discovery and Replication populations
Three software programs (Ingenuity Variant Analysis, Varseq and the CMH Variant Warehouse) were used to assess common shared (by affected mothers per family) rare variants.
Only those variants that passed the prioritizing steps with at least two of the annotating software tools were considered valid and are described below (summarized in S4 Table). The benefits of comparing data obtained from multiple software is that it minimized the possibility of picking up falsely called variants that passed quality control filters only by one software. This approach resulted in a total of 844 variants in the Discovery population. For the Replication population, we combined and compared the shared rare variants passing the annotation and prioritizing steps of Ingenuity Variant Analysis and Varseq; a total of 8431 variants passed the filters of both software tools. The CMH Variant Warehouse was not available for the Replication set. For both populations, variants were categorized as loss of function, moderate, or other, according to their predicted consequences, i.e. pathogenicity (S5 Table). We further compared the list of variants resulting from the family-based analyses (as described above) between the Discovery and Replication populations. Numbers of common genes and variants for both populations are shown in S2 Fig. There were 72 rare variants that were found in both populations in 72 genes (S6 Table).

HSPA1L variants associate with preterm birth in GWAS data of 40,000 mothers
Rare single nucleotide variants from HSPA1L [heat shock protein family A (Hsp70) member 1 like], identified by the Discovery Ingenuity pathway analysis, were further investigated using imputed GWAS data that also included variants with MAF <1%. In the Discovery set, variants in AR, NCOA3 and NCOR2 were either CAG repeat length polymorphisms, in-frame deletions or insertions, respectively, and were, therefore, not investigated in the GWAS datasets. Three independent GWAS datasets were used, one of general European ancestry containing more than 40,000 mothers of live births (23andMe dataset) and two from Northern Europe containing 4,600 and 600 mothers (Nordic and northern Finnish datasets, respectively). In the large 23andMe preterm birth GWAS dataset, the minor allele of rs34620296 in HSPA1L, which is in the glucocorticoid receptor signaling pathway, was found to be more common in cases than in controls (case frequency 0.0025 vs. control frequency 0.0010, p = 0.002; Table 3). This association was also significant for gestational age as a continuous trait (gestational age as weeks; p = 0.0016, effect -0.8238, standard error 0.2608). The HSPA1L variants from the Discovery (rs34620296 and rs150472288) and the Replication (rs482145, rs139193421) analyses are listed in detail in Table 3. In the two smaller GWAS datasets, however, these four HSPA1L variants were absent or not significant. Lack of significance may be due to smaller numbers of individuals, especially in cases.

Sanger sequencing to validate HSPA1L variants and to genotype additional family members
Sanger sequencing confirmed the genotypes of the two rare HSPA1L missense variants (rs34620296 and rs150472288) in the samples from the Discovery cohort. The rare HSPA1L variants were observed in a total of six mothers from four unrelated families. Additional family members with available DNA were sequenced for these variants. Interestingly, in two of the families, female carriers of the maternally inherited rs34620296 minor T-allele were born preterm, whereas in the other two unrelated families the male carriers of maternally inherited rs150472288 minor T-allele were born preterm. However, numbers of minor allele carriers are too small for any definite gender related conclusions.

Assessing the potential functional impact of the rare HSPA1L variants
Pathogenicity predictions for rs34620296 and rs150472288 derived from the Discovery cohort as well as for rs482145 and rs139193421 from the Replication cohort were assessed using in silico tools SIFT and PolyPhen-2, and all these variants were predicted as damaging and probably/possibly damaging, respectively (Table 4). In addition, MutationTaster and MutationAssessor predicted all four variants as disease causing and predicted functional (high), respectively. According to the Combined Annotation Dependent Depletion (CADD) score (>20), all of these variants, except for rs139193421, are among the top 1% of deleterious variants in human genome (Table 4). To assess potential consequences of these variants on transcriptional activity, we evaluated them for evidence of histone modification or DNase I hypersensitivity. In silico tools HaploReg 4.1 and/or RegulomeDB showed that all four variants were in regions that had histone marks, as well as strong transcriptional regulatory signatures in various cells of the immune system, especially in T lymphocytes from peripheral blood ( Table 4). Evidence of an active transcription start site was predicted in the HeLa-S3 Cervical Carcinoma Cell Line for rs34620296 and in foreskin fibroblast primary cells for rs482145 (Table 4). Further evidence of active DNA accessibility (DNAse) was found in ovarian tissue for rs150472288, and in psoas muscle tissue for rs139193421 (Table 4). There was also evidence of a transcriptional effect of rs34620296 and rs150472288 (Discovery) in ovary and fetal adrenal gland. Together these results from HaploReg 4.1 and RegulomeDB provide evidence for the potential involvement of HSPA1L variants in the endocrine system, as well as in the adaptive immune cells. These variants could, therefore, have a role in the etiology of SPTB.
We further investigated putative effects of HSPA1L rs34620296 on protein structure. This variant was selected due to its association with SPTB in the large 23andMe GWAS dataset. This variant causes an amino acid change from Alanine to Threonine at position 268 (Ala268Thr). According to the NetPhos 3.1 in silico prediction, Ala268Thr generates an additional phosphorylation site next to an existing phosphorylation site (T267-p) (S3 Fig). Furthermore, Ala268Thr is near an adenosine triphosphate (ATP) nucleotide-binding site located downstream at position 270−277 (Fig 2A). Gain of phosphorylation may cause changes in binding energy, modulate physio-chemical properties or stability kinetics and dynamics of the protein functions such as strength of protein-protein interactions [22].
To investigate the possible effects of the missense variant on protein structure, the reference HSPA1L protein structure and a structure including the Ala268Thr variant were compared simultaneously using UCSF Chimera. There was not a visible change in the overlaid protein structures (Fig 2B). Instead, there was a slight change in the chemical bond lengths (!0.002Å) of the adenosine diphosphate (ADP)-ligand binding amino acid side chains at positions Glu270, Arg274 and Asp368, shown in the 3D model of the HSPA1L (Fig 2C). This may be due to the change from a small size, and hydrophobic, (Ala) to medium size, and polar, (Thr) residue. Such a change in the amino acid side chains could affect the binding efficiency of the ADP molecule.

Functional consequences of the HSPA1L Ala268Thr variant
To determine whether the HSPA1L Ala268Thr (rs34620296) variant alters activity of the GR signaling pathway, we analyzed the consequences of glucocorticoid exposure during decidualization. Human endometrial stromal fibroblasts were transfected with plasmids containing either WT or Ala268Thr cDNA, or with empty vector serving as control. The cells were treated with decidualization media for 72h in a presence of glucocorticoids (100nM dexamethasone) as a surrogate of stress. Protein levels of HSPA1L and GR, as well as mRNA levels of Wnt Family Member 4 (WNT4) were measured. Cells transfected with the WT HSPA1L-pcDNA 3.1 trended to greater increases in cytosolic HSPA1L protein content than those transfected with the Ala268Thr HSPA1L-pcDNA3.1 (mean ± SEM; 1.272 ± 0.142 vs. 0.893 ± 0.146, respectively, p = 0.09) (Fig 3). Furthermore, the Western blot analysis showed that the relative cytosolic protein levels of GR differed significantly between the WT and Ala268Thr groups with more GR present in the WT group than in the Ala268Thr group (mean ± SEM; 1.309 ± 0.099 vs. 0.993 ± 0.096, respectively, p = 0.04) (Fig 3; numerical data available in S7 Table).
Next, we determined the relative gene expression of WNT4 by qPCR. WNT4 is a critical decidualization target found in the recent GWA study [14] associated with gestational length. Increased expression of WNT4 was observed in the WT group, whereas, the Ala268Thr group was less able to activate the WNT4-signaling pathway leading to a lower expression of WNT4 (p = 0.04). HSPA1L-pcDNA3.1 constructs or with empty pcDNA3.1 vector (control). Cells were treated with decidualization media supplemented with 100nM dexamethasone (glucocorticoids) for 72h. Both cytosolic and nuclear protein were extracted, and HSPA1L and GR protein levels were measured by Western blot. Band intensity of HSPA1L or GR was normalized to band intensity of the corresponding β-actin. Cytosolic (A) and nuclear (B) HSPA1L levels as well as cytosolic (C) and nuclear (D) GR levels are shown for control (empty vector), WT and Ala268Thr sample groups. Each experiment was performed as triplicates in three different passages (n = 9 each group, except n = 8 for nuclear control group) and bars represent mean + SEM. Significant p-value <0.05 is presented with an asterisk. Cytosolic GR levels were significantly higher (p = 0.04) in the WT compared to the Ala268Thr group as well as in the WT compared to the control group (p = 0.04). https://doi.org/10.1371/journal.pgen.1007394.g003

Discussion
To move beyond traditional case-control GWAS and family-based linkage studies, we performed a case-only whole exome sequencing study designed to investigate the burden of rare variants in families with recurrent SPTB. Whole exome sequencing enables the discovery of rare, putatively functional variants associated with the etiology of complex disease on a geneby-gene or a pathway-by-pathway basis, and enrichment in multiplex families provides a means to filter large-scale sequencing data.
Comparisons of mothers with recurrent preterm deliveries identified the glucocorticoid receptor signaling pathway as a candidate for mediating the risk of SPTB. Specifically, within this pathway, likely pathogenic missense variations in HSPA1L were found among four unrelated Finnish families (rs34620296 and rs150472288), and within Danish sister pairs (rs482145 and rs139193421). Notably, the rs34620296 minor allele variant was observed at a higher frequency in cases than controls in a very large 23andMe GWAS set. These variants were also identified via bioinformatics analyses as likely affecting either protein function or expression. Further functional evidence linked HSPA1L activity and decidualization.
HSPA1L is a member of the Hsp70 superfamily and is near HSPA1A and HSPA1B within the major histocompatibility complex class III region on chromosome 6. The HSPA1L protein (also known as Hsp70-hom) is~90% identical to HSPA1A and HSPA1B, also known as Hsp70-1 and Hsp70-2, respectively [23,24]. Heat shock proteins (HSPs) are highly conserved cellular defense mechanisms for cell survival and are present in all cell types in all organisms. Some HSPs are expressed constitutively, while others are stress-induced (e.g. heat, hypoxia, oxidative stress, infection and inflammation) [25,26]. Intracellular HSPs act as molecular chaperones and, together with co-chaperones, stabilize existing proteins against aggregation, mediate folding of newly translated proteins, and assist in protein translocation across intracellular membranes [25,27]. HSPs are categorized into families according to their approximate molecular weight; of which Hsp70 (a group of proteins sized approximately 70 kDa) is the best characterized. Potential involvement of stress-induced HSPA1A in adverse pregnancy outcomes, including preeclampsia and PTB, has previously been suggested [28,29]. Although, studies of the role of HSPA1L and HSPA1L in pregnancy are lacking, there is some evidence of involvement in adverse pregnancy outcomes such as preeclampsia [30].
The rare HSPA1L missense variants observed in our study, are in the nucleotide-binding domain (NBD), except the rs482145, which is in the substrate-binding domain (SBD) (Fig 2). ATP binds to the NBD, which is followed by the exchange from low-binding affinity ATP state to high-binding affinity ADP state [29,31,32]. We showed that the non-synonymous variant rs34620296 (Ala268Thr) generates an additional phosphorylation site near the nucleotidebinding site. It showed a modest change in the binding efficiency at this site, which could affect the interaction with ADP or HSPA1L stability itself, as suggested by our transfection studies. In agreement with our findings, a previous study of Caucasian patients with inflammatory bowel disease found that rare mutations in HSPA1L were significantly enriched in patients but absent in healthy controls [33]. Interestingly, one of the associated rare variants was Ala268Thr, and further in vitro biochemical assays of the recombinant HSPA1L showed reduced chaperone activity with this variant [33]. There is also evidence that possibly connects inflammatory bowel disease to adverse perinatal outcomes [34]. Additionally, a previous SPTB study in African Americans found a common nonsynonymous HSPA1L variant, rs2075800, to associate with SPTB [35]. Furthermore, a meta-analysis of previously PTB associated genes linked HSPA1L and SPTB using Ingenuity Pathway Analysis [36].
Due to a very low incidence of the rare HSPA1L variants associating with SPTB in our study, the anticipated attributable risk in the population level is probably small. However, the identification of the damaging alleles may facilitate the identification of causative pathways. For instance, interaction between Hsp70 and Hsp90 chaperones as well as their co-chaperones is essential in the maturation and inactivation of nuclear hormone receptors (e.g. glucocorticoid, androgen, estrogen and progesterone receptors) [37,38]. In the absence of its ligand, glucocorticoid receptor (GR) is bound to a complex constituting of Hsp40, Hsp70 and Hsp90 chaperones; this complex keeps the GR in a ligand-receptive conformation but remaining transcriptionally inactive until ligand binding [38]. As shown previously [33], rare HSPA1L variants can cause partial loss of HSPA1L chaperone activity, and therefore, altered function or expression. Altered function of the chaperones can compromise the stability of the GR complex, leading to an accumulation of partially unfolded proteins that are prone for aggregation and degradation events [37].
Glucocorticoids, steroid hormones that mainly signal through the GR, have anti-inflammatory and immunosuppressive actions. Glucocorticoid signaling communicates with estrogen signaling pathways to tightly regulate the pro-and anti-inflammatory milieu in reproductive tissues [39], and progesterone signaling, via nuclear GR, mediates anti-inflammatory and immunosuppressive effects in genital tract during pregnancy [40,41]. Sustaining a pregnancy is a complex interplay and balance between the innate and adaptive immune cells in the reproductive tissues and at the maternal-fetal interface. Imbalance between the inflammatory cells can cause a breakdown of maternal-fetal tolerance leading to activation of labor (both term and preterm). An untimely stimulus (e.g. stress, infection or inflammation) together with impairments in the glucocorticoid receptor signaling pathway could impose an inadequate response against inflammation or stress. This can elicit a shift from an anti-inflammatory to pro-inflammatory microenvironment, causing a premature activation of labor initiating signals resulting in preterm birth [42,43].
Possible limitations of our study are that the Discovery and Replication populations were sequenced using different Next Generation Sequencing platforms, and primary quality control measures and variant calling methods were thus different. In addition, Next Generation Sequencing generates an enormous amount of data, which could lead to many sequencing artifacts that may be misidentified as variants. We attempted to minimize these artifacts by applying a variety of quality control filters and using a large internal control population to detect potential sequencing or annotation errors. We also compared the results of variant annotation and prioritizing filters from three different software tools to ensure reproducible results. Furthermore, reported variants were confirmed by Sanger sequencing. Another possible limitation was that our study did not include unrelated control samples. This limitation has been partly overcome with the use of additional large GWAS datasets including control samples.
In conclusion, whole exome sequencing of families with recurrent occurrence of SPTB enables identification of rare alleles influencing the predisposition to SPTB. Among the individual genes, two minor alleles of HSPA1L had a strong association to SPTB in multiplex Finnish families and the association of a specific minor allele was confirmed in a large GWAS set. Furthermore, this variant was associated with altered modification and function of the protein.
Overall, our data suggest the need for precise regulation of steroid signaling in mediating birth timing.

Ethics statement
Written informed consent was obtained from all individuals participating in this study, and the study was approved by the Ethics committees of the participating centers: Oulu University Hospital (78/2003, 73/2013), University of Southern Denmark (NVK#1302824), and University of Iowa (IRB#200608748). Individuals in the large European American GWAS were research participants of 23andMe, Inc., a personal genetics company. All 23andMe participants provided informed consent and participated in the research online, under a protocol approved by the external AAHRPP-accredited IRB, Ethical & Independent Review Services (E&I Review).

Study populations and inclusion criteria for SPTB cases
Discovery population: Northern Finnish multiplex families. A total of 17 mothers from seven northern Finnish families were subjected to whole exome sequencing. This included 13 mothers with spontaneous preterm deliveries, of whom 10 had recurrent (2-4) spontaneous preterm deliveries at less than 36 weeks of gestation. The remaining mothers were grandmothers (mothers of the affected index mothers) with pregnancy history including extremely preterm twin pregnancy or late preterm pregnancy. Pedigrees of the large families highlighting the mothers selected for this study are presented in S1 Fig. The majority (five out of seven) of these families have been included in our previous linkage studies [44,45].
Unrelated families with history of recurrent preterm births were selected retrospectively from the birth diaries of Oulu University Hospital from 1973-2003, and prospectively from 2003-2005. Descriptions of inclusion and exclusion criteria have previously been reported in detail [44,45]. A genealogical study was performed according to published criteria [46], and to reveal close consanguinity or common residence, families were followed up to 17 th century using the provincial and national archives of Finland. All families were of the northern Finnish origin. Furthermore, families with apparent maternal inheritance of SPTB were selected for this study. Spontaneous preterm birth was defined as birth before 36 completed weeks of gestation (excluding the borderline 36th week), with delivery occurring with intact membranes or following PPROM. Pregnancies with known risk factors for SPTB were excluded from the primary analyses.
Replication population: Danish sister pairs. This population set, recruited from Denmark, included 192 women from 95 families. This cohort consisted of 93 affected sister pairs (both sisters have given birth preterm) and two sister triads with preterm deliveries. Preterm delivery was set to occur before 37 completed weeks of gestation, and all women had experienced at least one PTB. In the majority (83%) of the sister pairs, both sisters had experienced a spontaneous PTB. All individuals were of the European ancestry.

Whole exome data generation
DNA samples of the 17 individuals from Discovery population were extracted from whole blood and saliva samples using standard methods [45]. Although, using DNA from both blood and saliva samples, there were no major difference in the overall sequencing metrics (alignment metrics or total number of variants) between the sample types. DNA samples were subjected to exon specific next generation sequencing performed at the Center for Pediatric Genomic Medicine, Children's Mercy Hospital (CMH; Kansas City, MO). Exome samples were prepared with the Illumina Nextera Rapid Capture Exome kit according to the manufacturer's protocols as described previously [47]. Sequencing was performed on Illumina HiSeq 2500 instruments utilizing v4 chemistry with 2 x 125 nucleotide sequences. Sequence data were generated with Illumina RTA 1.18.64.0 and bcl2fastq-1.8.4, and aligned against the reference human genome (GRCh37.p5) using bwa-mem [48], and variant calls were made using the Genome Analysis Toolkit (GATK) [49] version 3.2-2 using previously described methods [50]. Duplicate reads were identified and flagged with the Picard MarkDuplicates tool. Realignment of reads around known indels was performed with the RealignerTargetCreator and IndelRealigner, and variants were called on individual samples using the HaplotypeCaller modules of the GATK.
In addition, whole exome sequencing was performed on 192 affected individuals from 95 Danish families (Replication set). Exome capture of the samples were carried out with the BGI Exon Kit following manufacturer's protocols (BGI, Shenzhen, China). DNA libraries were generated using combinatorial Probe Anchor Ligation (cPAL) technology, and 35 base paired end reads were generated from 500 bp genomic fragments. Whole exome sequencing was performed using the Complete Genomics platform (BGI) and using the manufacturer's pipeline. Reads were aligned against the National Center for Biotechnology Information (NCBI) build 37 reference human genome.

Variant annotation, filtering and prioritizing
The variant call files (VCF), containing the variant call results, generated by CMH and BGI were analyzed using Ingenuity Variant Analysis Software (Qiagen, Germany) and Golden Helix VarSeq Software v.1.2.1 (Bozeman, MT) for both the Discovery and Replication population sets. Variants were filtered based on variant quality control measurements, frequency and predicted pathogenicity, as well as a dominant inheritance model.
In the Ingenuity Variant Analysis, low quality variants (read depth <15 and call quality <20) were removed. Furthermore, we only included rare variants (i.e. MAF <1% in the 1000 Genomes Project, ExAC or in European American population in NHLBI ESP exomes) and variants that would likely have functional effect (i.e. variants that are predicted by SIFT or PolyPhen-2 as damaging or likely damaging, listed in Human Gene Mutation Database, or associated with gain or loss of function of a gene).
In VarSeq, variants with read depth <15 and genotype quality score <20 were excluded. Only rare variants in Europeans (MAF <1% or absent; 1kG Phase 3: Variant frequencies 5, GHI Jan 2015) and missense or loss-of-function variants were included for analyses. Further filtering was applied to the data obtained from VarSeq. Variant quality was enhanced by applying range criteria (0.3-0.85) for alternative allele frequency (i.e. ratio of alternate allele read depth / alternate allele read depth + reference allele read depth); variants outside this range were excluded. Allele frequencies were searched from the Sequencing Initiative Suomi (SISu) database (www.sisuproject.fi) for variants originating from the Finnish mother data, and from the Exome Aggregation Consortium (ExAC) database (http://exac. broadinstitute.org/) for Danish sister pair data. For these variants, a MAF cut-of value <1% in Finnish general population (SISu) or European "non-Finnish" (ExAC) population was used for Finnish or Danish mothers, respectively. The SISu database was used for Finnish mothers to exclude rare variants that are enriched in Finnish general population compared to the rest of the Europeans.
For the Discovery samples, we also used allele frequency calculations derived from Center for Pediatric Genomic Medicine's CMH Variant Warehouse database (http://warehouse.cmh. edu) including~3900 individuals previously sequenced at the center [50]. Pathogenicity was categorized according to the American College of Medical Genetics [21] as 1; previously reported to be disease-causing, 2; expected to be pathogenic (loss of initiation, premature stop codon, disruption of stop codon, whole-gene deletion, frame shifting indel, and disruption of splicing), and 3; unknown significance but potentially disease-causing (nonsynonymous substitution, in-frame indel, disruption of polypyrimidine tract, overlap with 5' exonic, 5' flank, or 3' exonic splice contexts). Only variants that fit one of these criteria (1−3) were included for analyses. Rare and novel variants with relatively high frequency in this internal control population were also excluded as they were thought to be technical artifacts.

Model of inheritance, gene and variant sharing filters
In Ingenuity Variant Analysis, a dominant inheritance model (including gain of function variants, and all heterozygous, compound heterozygous, haploinsufficient, hemizygous, and hetambiguous variants) was used to investigate predisposing variants that are inherited in the families. When analyzing affected mothers as a group, rare variants in genes that were common for a proportion of all cases were investigated. Whereas in family specific analyses, only variants that were shared by the affected individuals within each family were included.

Pathway analysis
Ingenuity Variant Analysis provides a list of most significant pathways calculated specifically for each filtering output. P-values were calculated according to Fisher's exact test assessing overlap enrichment of dataset-variant genes relative to known phenotype-implicated genes.
Here, only pathways with p<0.01 were included for further analyses.

Investigation of rare variants in genome-wide association study data
To investigate rare variants in genes arising from the whole exome data in a larger population setting including controls, we used available sources of preterm birth GWAS data. GWAS data from a large cohort, identified among 23andMe's research participants, included 43,568 mothers of general European ancestry [14] and meta-analysis data including 4,632 mothers from three independent Nordic (Finnish, Danish, and Norwegian) birth cohorts of European ancestry [51]. In addition, a set of GWAS data from a total of 608 mothers passing quality control measures was available. This set included mothers with spontaneous preterm deliveries and mothers with term deliveries originating exclusively from northern Finland. Genotyping was performed with Illumina Human CoreExome chip, followed by prephasing and imputation procedures with ShapeIT2 [52] and IMPUTE2 [53]. Association analysis was performed using SNPTEST v. 2.5.2 [54].

Resequencing the candidate variants
Since WES methodologies are associated with significant false positive rates, the presence of interesting variant findings from WES analyses was confirmed using Sanger sequencing. Samples were sequenced using capillary electrophoresis with ABI3500xL Genetic Analyzer (Applied Biosystems, CA) in Biocenter Oulu Sequencing Center, University of Oulu, Oulu, Finland. Details of the PCR primers and reaction conditions are available upon request.

Assessing the functional impact of the rare HSPA1L variants using in silico modeling
The possible functional effect of the rare HSPA1L variants (rs34620296 and rs150472288 from Discovery analyses as well as rs482145 and rs139193421 from Replication analyses) were investigated using in silico prediction tools such as SIFT, PolyPhen-2, MutationTaster and Mutatio-nAssessor. These pathogenicity predictions were annotated using Varseq, whereas CADD scores to identify pathogenic and deleterious variants were obtained from Ingenuity. Variants with CADD score >20 are amongst top 1% of deleterious variants in human genome [55]. RegulomeDB (http://www.regulomedb.org/) was used to investigate variant locations for e.g. chromatin state activity. In addition, we used HaploReg v4.1 (http://archive.broadinstitute. org/mammals/haploreg/haploreg.php) to assess whether variants are located within regions that show evidence for promoter or enhancer activity (i.e. presence of histone modification marks H3K4me1 and H3K27ac that are associated with enhancer regions, or H3K4me3 and H3K9ca that are associated with promoter regions), as well as for DNase I hypersensitivity in human tissues and cell line samples.
We further investigated the potential effects that missense variation rs34620296 (Ala268Thr) could have on protein sequence or structure. We used NetPhos 3.1 [56] to investigate possible changes in phosphorylation events in HSPA1L sequence. NetPhos 3.1 predicts serine, threonine or tyrosine phosphorylation sites in amino acid sequences of eukaryotic proteins. Evidence of being a phosphorylation site is given when the score is above the threshold (0.5). To investigate the possible effects of missense variant in protein structure, the reference protein structure and the modified protein structure, including the missense variant, were compared. Original (UniProtKB: P34931) and modified (Ala268Thr) amino acid sequences were submitted to SWISS-MODEL (https://swissmodel.expasy.org/) for protein modeling. Resulting protein models were compared simultaneously using UCSF Chimera. Molecular graphics and analyses were performed with the Chimera-1.11.2. package (https://www.cgl. ucsf.edu/chimera/). Chimera was developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco (supported by NIGMS P41-GM103311) [57].

In vitro functional analysis using human endometrial stromal fibroblasts
Construction of pcDNA 3.1(+) plasmid with HSPA1L inserts. pGEX-6P-1 vector including HSPA1L WT or Ala268Thr inserts were generously provided by Dr. Michael Snyder and Dr. Shinichi Takahashi from Stanford University. These vectors were generated as described in their previous study [33]. To obtain a more suitable vector for human derived cells, the HSPA1L inserts were cut from the pGEX-6P-1 vector using BamHI-HF (New England Biolabs) and NotI (Invitrogen) restriction enzymes. The HSPA1L sequences were subcloned into pcDNA3.1(+) vector (Invitrogen) by ligating into the BamHI/NotI sites using Quick ligase (New England Biolabs). The recombinant plasmids (WT or Ala268Thr) were sequenced to confirm the successful construction before continuing into further functional studies.
Cell culture and transfection. Two primary human endometrial stromal fibroblast (ESF) lines (HsESC_217S and HsESC_218S) were obtained from the Hugh Taylor/Clare Flannery labs in the Department of Obstetrics, Gynecology, and Reproductive Sciences at Yale University. HsESC_217S and HsESC_218S were derived from two separate patients each undergoing a polypectomy surgical procedure on days 3 and 9 of the menstrual cycle, respectively. ESFs were cultured in DMEM (GIBCO), supplemented with 10% fetal bovine serum (FBS), 200mM L-glutamine and 1% penicillin-streptomycin (GIBCO) in a humidified incubator at 37˚C with 5% CO 2 .
Western blotting. From the harvested cell pellet, protein was extracted using NE-PER Nuclear and Cytoplasmic Extraction Kit (Thermo Fisher Scientific) according to the manufacturer's instructions and stored at -80˚C until used. Protein content was assessed using Pierce BSA Protein Assay (Thermo Fisher Scientific) according to the manufacturer's instructions. Denaturated protein samples (20 μg for cytoplasmic or 10 μg for nuclear proteins) were loaded onto NuPAGE 4-12% Bis-Tris gels (Invitrogen) and separated by electrophoresis in 1x NuPAGE MOPS SDS running buffer (Thermo Fisher Scientific). Protein bands were electroblotted onto a nitrocellulose membrane (GE Healthcare) using NUPAGE transfer buffer (Thermo Fisher Scientific) and 20% methanol. After transferring the bands, membranes were blocked in 5% nonfat dry milk in Tris buffered saline (TBS; Fisher Scientific) containing 0.1% Tween 20 (Sigma) for 1h in RT. Membranes were incubated overnight in +4˚C with primary monoclonal rabbit antibodies against HSPA1L (dilution 1:1000; GTX111045; GeneTex), GR (dilution 1:1000; #3660S; Cell Signaling Technologies) or, as control, β-actin (dilution 1: 20,000; A5060; Sigma) in 5% nonfat dry milk TBS-0.1%Tween. In the following day, membranes were washed with TBS-0.1%Tween and incubated with horseradish peroxidase conjugated goat anti-rabbit IgG secondary antibody (dilution 1: 10,000; sc-2004; Santa Cruz Biotechnology) for one hour in RT. Membranes were washed with TBS-0.1%Tween, and for luminol based band detection, treated with Super Signal West Pico or Dura Luminol (Thermo Fisher Scientific). Protein band intensities of HSPA1L and GR were quantified relative to corresponding β-actin band (i.e. protein of interest: β-actin ratio) using Image J. The data was tested for normality, and one-way ANOVA in GraphPad Prism v.7.04 (GraphPad Software, Inc.) was used to compare the protein levels. Statistical significance was defined as a p<0.05.
RNA extraction, cDNA synthesis and real-time quantitative PCR. Total RNA was extracted using the RNeasy Mini Kit (Qiagen). cDNA synthesis from total RNA was performed using Quantitect Reverse Transcription Kit (Qiagen) according to manufacturer's instructions. Expression levels of mRNA were determined by real-time quantitative PCR (qPCR) on Step One Plus real-time PCR system (Applied Biosystems). Amplification reactions were performed as duplicate in 20μl final volume containing 25ng cDNA, WNT4 and GAPDH TaqMan gene expression assays (Hs01573505_m1 and Hs02758991_g1, respectively, Thermo Fisher Scientific), and 2x TaqMan Gene Expression Master Mix (Thermo Fisher Scientific). GAPDH was used as a housekeeping gene. Relative quantification of WNT4 was performed using the ΔΔCt method. One-tailed Student's t test was used to compare gene expression levels.
Supporting information S1 Fig. Pedigrees of the seven northern Finnish multiplex families with recurrent spontaneous preterm births. Circles represent females and squares males; symbols with a line intersecting indicate that individual is deceased. Diamonds denote an unspecified number of infants born at term, i.e. gestational age (GA) !37 weeks. Letters above symbols indicate individuals for whom WES data was gathered and an asterisk next to the letter denotes the individual as a recurrent mother. Letters inside brackets indicate grandmothers with term, twin or non-spontaneous deliveries, and were not included in the primary analyses. Pedigrees were created using Progeny Pedigree tool (https://pedigree.progenygenetics.com/). (PDF) . GWAS data for the preterm birth project DNBC samples were generated within the GENEVA consortium with funding provided through the NIH Genes, Environment, and Health Initiative (GEI, preterm birth: U01HG004423, dbGaP accession number phs000103.v1.p1). The GENEVA Coordinating Center (U01HG004446) provided assistance with genotype cleaning and general study coordination.