Exome sequencing in large, multiplex bipolar disorder families from Cuba

Bipolar disorder (BD) is a major psychiatric illness affecting around 1% of the global population. BD is characterized by recurrent manic and depressive episodes, and has an estimated heritability of around 70%. Research has identified the first BD susceptibility genes. However, the underlying pathways and regulatory networks remain largely unknown. Research suggests that the cumulative impact of common alleles with small effects explains only around 25–38% of the phenotypic variance for BD. A plausible hypothesis therefore is that rare, high penetrance variants may contribute to BD risk. The present study investigated the role of rare, nonsynonymous, and potentially functional variants via whole exome sequencing in 15 BD cases from two large, multiply affected families from Cuba. The high prevalence of BD in these pedigrees renders them promising in terms of the identification of genetic risk variants with large effect sizes. In addition, SNP array data were used to calculate polygenic risk scores for affected and unaffected family members. After correction for multiple testing, no significant increase in polygenic risk scores for common, BD-associated genetic variants was found in BD cases compared to healthy relatives. Exome sequencing identified a total of 17 rare and potentially damaging variants in 17 genes. The identified variants were shared by all investigated BD cases in the respective pedigree. The most promising variant was located in the gene SERPING1 (p.L349F), which has been reported previously as a genome-wide significant risk gene for schizophrenia. The present data suggest novel candidate genes for BD susceptibility, and may facilitate the discovery of disease-relevant pathways and regulatory networks.


Introduction
Bipolar disorder (BD) is a severe neuropsychiatric disorder with an estimated heritability of around 70% [1]. BD is characterized by recurrent episodes of depression and mania or hypomania. The lifetime prevalence of BD in the general population is around 1% [2], and the World Health Organization ranks BD among the top 10 leading causes of disease burden for the age group 15-to-44 years (The global burden of disease: 2004 update) [3,4].
Molecular genetic research indicates that BD is a multifactorial disorder, in which both genetic and environmental factors impact disease susceptibility [5]. Recent genome-wide association studies (GWAS) have identified the first BD susceptibility genes [3,[6][7][8][9][10]. However, the underlying pathways and regulatory networks remain largely unknown [11]. Research has shown that the cumulative impact of common alleles with small effects explains only around 25-38% of the phenotypic variance for BD [12,13]. A plausible hypothesis therefore is that rare, high penetrance variants may contribute to BD susceptibility. One approach to the evaluation of this hypothesis is the investigation of large, multigenerational pedigrees with a high prevalence of BD. In such pedigrees, the existence of a genetic variant of strong effect-as inherited from a common ancestor-may be more likely [14].
Results from initial whole exome and genome sequencing studies of BD suggest that the application of next generation sequencing is a promising method for the identification of rare variants in BD candidate genes and disease-related pathways. Recent investigations have comprised pedigree and trio analyses, as well as case-control studies [15][16][17]. Preliminary results suggest an enrichment of rare variants in gene-sets previously implicated in BD, such as mitogen-activated protein kinase (MAPK) signaling; axon guidance; calcium signaling; cyclic adenosine monophosphate response element binding protein (CREB) signaling; and potassium channels [18][19][20][21][22][23][24]. These results also implicate gene-sets identified in de novo studies of autism and schizophrenia in BD etiology [17]. A recent trio-based exome sequencing study suggested that de novo mutations may play a role in BD etiology, particularly in patients with early disease onset [15]. However, the overlap in implicated candidate genes between studies was limited, suggesting that the analysis of further samples and pedigrees is warranted before more definitive conclusions can be drawn.
The aim of the present study was to investigate the role of rare, nonsynonymous, and potentially functional variants in BD susceptibility by performing whole exome sequencing in two large, multiply affected BD families from Cuba. The high lifetime prevalence of BD in these pedigrees renders them promising in terms of the identification of high penetrance genetic risk variants. In addition, polygenic risk score (PRS) based analysis, that uses cumulated information contained in many small association signals throughout the genome, was performed to elucidate the underlying genetic architecture.

Phenotypic assessment and DNA collection
The present study was performed within the context of the Cuban Danish project on BD and schizophrenia, and was approved by the ethics committee of the National Center of Medical Genetics in Havana, Cuba. Written informed consent was obtained from all participants prior to inclusion.
All participants were recruited in the southeastern region of Cuba and underwent initial psychiatric assessment in 1997 and follow-up assessment in 2016. Each participant was examined at home by a research psychiatrist in cooperation with a local psychiatrist.
Best estimate diagnoses were assigned on the basis of: the Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) (SCID); structured information obtained from other family members; and information from the responsible psychiatrist. DNA was obtained from whole blood by salting-out with saturated sodium chloride solution [25].

Sample description
The first family comprised six generations and 32 individuals (41% male), including 17 individuals with a known psychiatric disorder. DNA was available for a total of 11 individuals ( Fig  1A) with the following diagnoses: BD type I (BD I, n = 3); BD type II (BD II, n = 1); BD not otherwise specified (BD NOS, n = 1); alcohol abuse (n = 1); and no current or lifetime psychiatric disorder (healthy subjects, n = 5). The second family (Fig 1B) was bilineal. This family comprised four generations and 71 individuals (54% male), including 32 individuals with a known DSM-IV psychiatric disorder. DNA was available for 20 individuals with the following diagnoses: BD I (n = 4); BD II (n = 9); major depressive disorder (MDD, n = 1); alcohol abuse (n = 2); and healthy subjects (n = 4). To facilitate the analyses, family 2 was divided into subpedigrees 2, 3, and 4. For the 15 individuals selected for whole exome sequencing, additional clinical information is provided in S1 Table.

Polygenic risk scores (PRS)
PRS were calculated for all members of the two Cuban families with available SNP array genotype data. The calculations were performed using the summary statistics of the largest GWAS of BD to date. This GWAS was performed by the PGC, and comprised 20,352 cases and 31,358 controls [31]. After clumping of summary statistics based on best-guess genotype data, PRS were calculated using imputed dosage data in R v3.3 [32]. In all PRS computations, the allele with the positive effect direction was considered. The PRS therefore represent weighted, cumulative, additive risk. Cumulative PRS (sum of risk load) were calculated using 10 different pvalue thresholds: <5×10 −8 , <1×10 −7 , <1×10 −6 , <1×10 -5 , <1×10 −4 , <0.001, <0.01, <0.05, <0.1, and <0.2. PRS were scaled to represent relative risk load (minimum possible cumulative risk load = 0, maximum = 1). PRS analyses were conducted in R v3.3, using the function polygenic of the R package GenABEL, which implements a linear mixed model that takes family structure into account [33]. Gender was used as a covariate in the analysis. In a first step, residuals were calculated in the polygenic model using the formula phenotype~covariates, plus the genetic relationship matrix as a random effect. Residuals from this model were then used in a second linear model, with the formula residuals~PRS. In the second step, association test statistics with 95% confidence intervals (CI) were calculated using bootstrapping (R package boot, non-parametric bootstrapping using ordinary resampling with 2,000 replications) [34,35]. In accordance with the hypothesis that family members with BD have increased PRS for BD compared to healthy relatives, one-sided p-values were calculated for all PRS-based analyses. Since the residuals were not normally distributed, p-values were confirmed using permutation analyses (10,000 permutations) (S2 Table).

Exome sequencing and data analysis procedure
The exome sequencing step included a total of 15 BD cases (8 females, 7 males) from the two families. The three remaining BD cases were excluded. Two of these three individuals (individual 11, family 1; and individual 19, family 2) were excluded as they had been diagnosed with MDD in the initial 1997 assessment, and assigned a diagnosis of BD II in the reassessment in 2016, when exome sequencing in individuals with a BD diagnosis had already been performed. Individual 17 of family 2 was excluded on the grounds of being related by marriage only. The initial diagnoses of the BD cases, as well as the respective subtypes (BD I, II, or NOS), remained stable, and were confirmed during the second assessment. The SureSelect XT Human All Exon V5 capture library from Agilent Technologies (Santa Clara, California, USA) was used for target enrichment. Exome sequencing was performed on an Illumina HiSeq2500 v4 system (San Diego, California, USA) using 2 x 125 bp paired-end sequencing. A brief technical report is provided in Table 1. The generated exome sequencing data are available at the European Genome-phenome Archive (EGA; https://ega-archive.org) under accession number EGAS 00001003085.
For the analysis of pedigree data, the Varbank pipeline of the Cologne Center of Genomics was used (https://varbank.ccg.uni-koeln.de). This pipeline provides a user-friendly interface and incorporates a set of standard filters and annotations. The first filter was set to detect heterozygous variants with an allele read frequency of 25-75% relative to the reference. Variants with a minimum coverage of 10x and a quality score of 10 were included. The Varbank workflow includes the Genome Analysis Toolkit (GATK), and the GATK Best Practices workflow was applied. To decrease the number of specific pipeline artifacts, the default settings of the Varbank in-house database filter (comprising: epilepsy data, n = 511; and mixed case and control structural data, n = 611) were applied. The analysis of exome sequencing data focused on single-nucleotide variants (SNVs), as well as small insertions and deletions (InDels). The analysis was restricted to nonsynonymous variants with an effect on primary protein structure or a strong splice site effect. Focus was placed on variants that were shared by all BD cases within the respective pedigree.
To identify rare alleles, filtering for variants with a minor allele frequency (MAF) of < 0.1% was performed using the data of the Exome Aggregation Consortium (ExAC; http://exac. broadinstitute.org, version 0. 3,2015). For each variant, information concerning in silico predicted effects on protein structure was obtained from the database for nonsynonymous SNPs' functional predictions (dbNSFP). This database provides functional prediction scores from various algorithms, and annotations of nonsynonymous SNVs [36,37]. In accordance with Purcell et al., the analyses involved the commonly applied predictions of SIFT; Polyphen-2 (HumDiv-and HumVar-trained models); LRT; and MutationTaster [38]. Only variants that were predicted to have a potentially damaging (damaging/probably damaging/possibly damaging/deleterious/disease-causing) impact on protein structure by at least four of the five algorithms were included in subsequent analyses. Predictions for InDels were obtained from the tools MutationTaster and SIFT/Provean [39,40]. InDels predicted to have a possible deleterious effect by at least one of the three tools were included in the subsequent analyses.
To verify the exome sequencing results, Sanger sequencing was performed using standard polymerase chain reaction conditions and a 3130xl Genetic Analyzer (Applied Biosystems, Foster City, California, USA). Primer sequences are available upon request.
To estimate possible population effects, Sanger sequencing of the identified variants was performed in all individuals with available DNA from the two unrelated Cuban families, including individuals related by marriage only. These sequencing data were also used to follow up the segregation of variants in each family.

Brain-expression
To determine whether the identified candidate genes are expressed in the human brain, the Genotype-Tissue Expression (GTEx) Project portal (http://www.gtexportal.org/, on 05/11/

Non-parametric linkage analysis
The SNP array data were used to perform non-parametric linkage (NPL) analyses. NPL analysis is the method of choice for pedigree studies of complex genetic traits, since it is robust to uncertainty concerning the mode of inheritance [41]. NPL investigates whether affected individuals share alleles identity-by-descent more often than would be expected under a random segregation [41]. First, a QC of the SNP array data was performed using PLINK 1.9 (https:// www.cog-genomics.org/plink2) [42]. SNPs were filtered for a call rate of at least 95% and an MAF of � 5%. After linkage disequilibrium pruning with a window size of 500 kb and an r 2 threshold of 0.01, a set of 9690 SNPs remained. These were used for subsequent NPL analyses of pedigree 1 and subpedigrees 2 and 3. In addition, we also performed NPL analyses for the 17 rare candidate variants identified in the exome sequencing step. The two NPL Z-score statistics NPL all and NPL pairs were computed using the MERLIN software version 1.1.2 [43,44].
Both the standard linear model and the exponential likelihood model [45] implemented in MERLIN were applied to generate model-free LOD scores. Here, a strict phenotype definition was applied, whereby subjects with a diagnosis of BD I, BD II or BD NOS were classified as "affected"; healthy individuals were classified as "unaffected" and individuals with other psychiatric phenotypes were classified as "unknown".

Population substructure analysis
For the population substructure analysis, post-QC pre-imputation SNP array data were used. The analysis was carried out using PLINK 1.9 [42] and R. First, genetic data from the 1000 Genomes Project Phase 3 reference panel [29,30] were converted to the PLINK binary format and merged with the genotype data of the Cuban pedigrees. Second, the combined dataset was subjected to the following three variant filtering steps: (i) removal of variants with an MAF of < 0.05 or a HWE p-value of < 10 -3 ; (ii) removal of variants mapping to the extended MHC region (chromosome 6, 25-35 Mb) or to a typical inversion site on chromosome 8 (7-13 Mb); and (iii) LD pruning (command-indep-pairwise 200 100 0.2). Next, the pairwise identity-bystate matrix of all individuals was calculated using the command-genome on the filtered genotype data. A multidimensional scaling (MDS) analysis was then performed on the identity-by-state matrix using the eigendecomposition-based algorithm in PLINK.
Populations from 1000 Genomes were selected for the analysis using backward elimination, starting with all populations and retaining only the ones to which Cuban family members showed close clustering. The final population selection was: African ancestry in Barbados and the USA (ACB, ASW); Puerto Rican (PUR); Spanish (IBS); and Tuscan Italian (TSI). In an MDS analysis, high relatedness between family members leads to artifacts. To avoid such artifacts, only one representative subject of each highly related cluster was included in the analysis. Therefore, data on only 19 of the 31 family members are displayed (S1 Fig).

Relatedness and polygenic risk scores based on SNP genotype data
SNP array genotyping data confirmed the pedigree relationships depicted in Fig 1. After correction for multiple testing, PRS based on BD-associated genetic variants were not significantly higher in BD cases than in family members with no lifetime history of psychiatric illness (S2 Fig). When PRS calculation was restricted to BD variants with genome-wide significance (p-value threshold 5×10 −8 ), the PRS of BD cases from the two Cuban families showed a nominally significant increase compared to the PRS of family members with no psychiatric disorder or a non-BD psychiatric phenotype (p = 0.027, S2B Fig). Since affected family members showed no significant increase in the burden of common variants, an analysis of rare variants in these patients was considered a promising approach.

Pedigree information and technical summary
To identify rare variants with a potential contribution to BD susceptibility, a total of 15 individuals with BD I, BD II, or BD NOS were selected from the two Cuban families for whole exome sequencing. These comprised four subjects from family 1 (Fig 1A); and 11 subjects from family 2 (Fig 1B). Mean coverage ranged from 75x to 116x. On average, 91.5% of the target regions had a minimum coverage of 30x (Table 1). Family 2 was divided into subpedigrees 2, 3, and 4. Filtering and rare variant identification was performed separately for each subpedigree.

Analysis of exome sequencing data and rare variant identification
Application of the Varbank pipeline of the Cologne Center for Genomics (https://varbank.ccg. uni-koeln.de) led to the identification of an average of 1,019 variants with an MAF of < 1% per exome. A total of 20, 30, and 126 variants were shared by BD cases in pedigrees 1, 2, and 3, respectively. To identify variants of potentially higher penetrance, the analysis was restricted to very rare variants with an MAF of < 0.1%. This reduced the number of shared variants in the aforementioned pedigrees to 9, 14, and 59, respectively. Pedigrees 2 and 3 are unrelated, and gave rise to bilineal pedigree 4. Thus, none of the variants were shared by all BD cases from pedigree 4.
To prioritize the identified variants, filtering was performed for alleles that were potentially functional and predicted to be potentially damaging by at least four of five (SNVs) or one of three (InDels) of the applied prediction tools. This reduced the number of rare variants to 2, 3, and 12 variants in the respective pedigrees. These 17 candidate variants comprised 14 missense variants; and 3 small InDels, which all lead to a frameshift ( Table 2).
A broader variant list is provided in the S3 Table. This was generated using relaxed filter criteria (MAF � 5% according to ExAC and predicted to be potentially damaging by at least one of the five prediction tools).
Variants identified in pedigree 1 and subpedigrees 2 and 3. Bold formatting indicates alleles with a predicted potentially damaging effect on protein structure or function.

Segregation analysis
In the segregation analysis, testing for all 17 identified rare variants with potentially damaging effects on protein function was performed in all family members with available DNA from both families (Tables 3-5).
In pedigree 1 (Table 3), one rare variant located in the gene TMEM220 segregated with all five BD cases across four generations (Fig 1A; individuals 4, 5, 7, 9, and 11). This variant was also identified in two healthy relatives (Fig 1A; individuals 1 and 3), and one relative with alcohol abuse who had two children with a history of MDD (Fig 1A, individual 10).
The second candidate variant (rs141075266) was located in the SERPING1 gene. This variant segregated with BD in the four BD cases who had undergone exome sequencing, but was not present in the BD-affected offspring in the fifth generation (Fig 1A, individual 11). The variant was also identified in a healthy parent in the third generation, who was an obligate carrier (Fig 1A, individual 3).
In pedigree 2 (Table 4), three rare variants were identified in all BD cases: RCCD1; DNAH7 (rs201185180); and ABCA4 (rs61748549). Further evaluation of segregation in pedigree 2 was precluded by the absence of additional individuals with available DNA.
In pedigree 3 ( Table 5), none of the candidate variants identified in the exome-sequenced BD cases showed complete penetrance, since they were also detected in at least one of three  Table 3. Results of the segregation analysis of sequence variants in pedigree 1 as well as brain expression of the implicated genes and data from the NPL analyses. N, total number of individuals within each phenotype group. Numbers indicate how often the variants were found in the tested family members, as categorized according to phenotype. BD, Bipolar disorder types: BD I; BD II; and BD not otherwise specified. Other psychiatric disorders: recurrent major depressive disorder healthy relatives (Fig 1B; individuals 7, 8, and 20). Interestingly, the variants identified in EPS8L3 (rs148185176), MYH7, CRX (rs139340178), and SELENOO, were also present in two relatives with alcohol abuse (Fig 1B, individuals 6 and 18). For candidate variants identified in pedigrees 2 and 3, segregation was also investigated in four affected siblings from bilineal pedigree 4. In three of these four siblings, two missense variants in the genes COL3A1 and THYN1 were detected (Fig 1B; individuals 11, 12, 14, and 15). Individual 14 had transmitted the COL3A1 variant to their BD II-affected offspring (Fig 1B;  individual 19). On average, the remaining variants were present in 50% of the siblings and their offspring individual 19, in accordance with Mendelian expectation. One exception was the variant located in the gene SELENOO, which was not transmitted to any of the siblings. Tables 3-5 provide both a summary of these analyses, and details of the prioritized variants in the two multigenerational Cuban families. Information on genotypes at variant sites for the individuals in subpedigree 2 and the bilineal offspring in subpedigree 4. Individuals related by marriage only were excluded.

Pedigree 1 NPL
N, total number of individuals within each phenotype group. Numbers indicate how often the variants were found in the tested family members, as categorized according to phenotype. BD, Bipolar disorder types: BD I; BD II; and BD not otherwise specified. NPL, non-parametric linkage using the NPL all Z-score statistics; exLOD, logarithm of the odds (exponential model); p-value, probability of exLOD. Bold formatting indicates Reads Per Kilobase Million (RPKM) � 0.1. https://doi.org/10.1371/journal.pone.0205895.t004 All candidate variants identified in one pedigree were then sequenced in the other pedigrees and in relatives-by-marriage. None of the variants were found in the other family, or in subjects related by marriage only.

Non-parametric linkage analysis
Linkage analysis did not reveal any suggestive linkage signal exceeding the threshold of LOD > 2.2 [22] in any of the four applied approaches (NPL pairs and NPL all , linear and exponential likelihood model). This was attributable to the fact that the maximum achievable LOD scores were 1.47 in pedigree 1, 1.75 in subpedigree 2, and 0.60 in subpedigree 3 (NPL all , exponential likelihood model). For each rare, segregating candidate variant identified in the exome sequencing step, we calculated the LOD score of the NPL all exponential likelihood model (exLOD) and the corresponding p-value (Tables 3-5). In addition, exLOD scores and corresponding p-values of the markers closest to the variants with an MAF of � 5% and predicted to be potentially damaging by at least one of the five prediction tools are shown in S3 Table. Population substructure analysis Pedigree 1 mainly clustered with admixed Americans (PUR). However, two individuals also showed a degree of African admixture (ACB and ASW). Subpedigrees 2, 3, and 4 clustered with both Europeans (TSI and IBS) and admixed Americans. Married-in individuals mainly clustered with admixed Americans, with two married-in subjects showing differing degrees of African admixture (S1 Fig).

Discussion
In the present cohort, the PRS of BD cases, representing the genetic risk load for common BDassociated variants, were not significantly higher than the PRS of their unaffected relatives. This may suggest that BD susceptibility in the investigated Cuban pedigrees may have been conferred by highly penetrant genetic factors, rather than through polygenic effects. However, the results do not exclude the possibility that the high prevalence of BD in these families was attributable to the collective contribution of common variants, since the present approach had three limitations. First, no optimal p-value threshold has yet been established for the calculation of PRS on the basis of the-as yet unpublished-PGC GWAS of BD. The present analyses therefore applied ten different p-value thresholds for the inclusion of variants, and the significance threshold was adapted for ten tests. PRS calculated on the basis of variants with genomewide significance showed a nominally significant increase in BD cases. However, this result did not withstand correction for multiple testing. Second, the sample size for the PRS analysis was small (17 BD cases and 6 unaffected individuals), which resulted in limited statistical power. Third, if a polygenic contribution is present in families, unaffected family members may also show an increased PRS, while not having developed the disease due to the absence of other risk factors. To investigate this hypothesis, a comparison between unaffected family members from the present cohort and unrelated controls from the southeastern Cuban population is warranted.
The present whole exome sequencing approach identified 17 nonsynonymous, potentially functional, and rare variants with a potential contribution to BD development. All of these rare alleles showed incomplete penetrance for BD. This finding is unsurprising, since even large copy number variants-which represent the strongest known genetic risk factors for psychiatric disorders at the time of writing-show reduced penetrance [46].
The most promising finding of the present study was the rare coding variant in SERPING1, which was identified in pedigree 1. This gene is brain expressed and encodes the serpin peptidase inhibitor clade G member 1, a highly glycosylated plasma protein implicated in the complement cascade. The protein inhibits C1r and C1s of the first complement component, and thus regulates complement activation [47]. BD has shown a substantial genetic correlation of around 68% with schizophrenia and, interestingly, SERPING1 was among the genome-wide significant risk loci for schizophrenia identified in the largest GWAS of schizophrenia to date (rs9420, p = 2.24×10 −9 ) [12,48]. Furthermore, the complement cascade has recently been identified as an important pathway through which rare CNVs could modify risk for schizophrenia by influencing the efficiency of synaptic pruning [49]. The variant identified in the present study leads to a leucine to phenylalanine (p.L349F) substitution within a beta strand (http:// www.uniprot.org), and was predicted to have a deleterious effect on protein function by four of the five applied algorithms. SERPING1 is also a risk gene for hereditary angioedema (HAE). However, the present variant is not listed as a causal HAE mutation in established clinical databases [50]. In addition, no clinical features of HAE were reported in pedigree 1. Interestingly, a recent study has suggested a possible link between HAE and depression [51]. This association might be attributable to the chronic and life-threatening nature of HAE. Alternatively, it might suggest a potential pleiotropic effect secondary to the impact of rare variants.
The second variant in pedigree 1 was identified in TMEM220. At the time of writing, this gene is poorly characterized, and has no known role in any neuropsychiatric disorder.
In family 2, rare, potentially functional variants implicated a total of 15 candidate genes. Of these, 11 show brain-expression.
No overlap was found between the 17 potential candidate genes in the present study and findings from recent next generation sequencing studies of BD [15,17,19,20,[22][23][24]52]. This may reflect the pronounced genetic heterogeneity of BD; the variety of implicated pathways; ethnic differences as well as methodological differences between the respective studies, which applied differing variant prioritization approaches.
To generate further evidence for the role of the identified candidate variants in BD etiology, comprehensive analyses of data from further family-and replication studies are necessary. This could be achieved by combining exome sequencing data from international consortia, such as the Bipolar Sequencing Consortium [53]. In addition, the investigation of families with distinct subphenotypes may reduce genetic heterogeneity, and generate further insights into the observed differences in clinical presentation [20,54].
In the present analyses, data from the ExAC (http://exac.broadinstitute.org, version 0.3, 2015) were used to determine the MAF of all identified variants. ExAC compiles and harmonizes exome sequencing data from a variety of projects and cohorts, and comprises the exomes of 60,706 unrelated individuals. ExAC represents the largest collection of human exome data to date. The majority of ExAC data originate from Europe (60%). A lower proportion of the data originates from Africa (9%) and Latin America (10%). A recent admixture analysis demonstrated that the genetic ancestry of the Cuban population is 72% European, 20% African, and 8% Native American [55]. This is in line with the results of the present population substructure analysis, which indicated that the investigated pedigrees mainly clustered with Southern European and admixed American ethnicities. The ExAC data were therefore considered appropriate in terms of estimating the MAF of variants identified in the present cohort.
However, future studies are warranted to determine common and rare variant profiles in the Cuban population, a step which was beyond the scope of the present study.
All candidate variants identified in one pedigree were then tested in the other pedigrees and in individuals related by marriage only. According to the present sequencing data, none of the variants were found in both families or in subjects related by marriage. This provides further support for the hypothesis that the identified variants are indeed rare in the Cuban population. However, in view of the small sample size, the possibility that the deviations in MAF may have been attributable to population stratification cannot be excluded.
Candidate gene selection in the present study was based on a filtering and prioritization procedure. Although the applied prediction algorithms are widely used, their assessment of the potential impact of a given variant on protein structure and function has certain limitations. The applied approaches attempt to prioritize variants on the basis of their population allele frequencies and measures of conservation at the phylogenetic level, and/or in terms of amino acid properties [40,[56][57][58]. However, functional studies of the identified rare variants are required before more definite conclusions on their functional consequences can be drawn.
The focus of the present study was the identification of rare variants (MAF < 0.1%) with high penetrance. Since rare and low-frequency variants with higher MAFs (0.1-5%) and moderate effects may also contribute to BD development, a second filtering step was performed using relaxed criteria. The broader variant list generated in this step is presented in S3 Table  including linkage evidence from the respective regions. To assess their contribution to BD etiology, these variants require evaluation in future studies.
In conclusion, rare, potentially functional variants implicated a total of 17 genes in two multiply affected BD families from Cuba. These genes may thus play a role in BD etiology. The most promising variant was located in the gene SERPING1, which was reported as a genomewide significant risk gene for schizophrenia in previous research. The present data therefore suggest novel candidate genes for BD susceptibility, and may facilitate the discovery of diseaserelevant pathways and regulatory networks.
Supporting information S1 Table. Additional clinical information for individuals selected for whole exome sequencing. Overview of bipolar disorder (BD) subtype and psychosis-status of the investigated individuals. Age-at-onset is reported in 5-year intervals. (XLSX) S2 Table. Analysis of association between polygenic risk scores (PRS) and bipolar disorder (BD). Association test statistics including percentile-based 95% bootstrap confidence intervals (CI) for each of the ten p-value thresholds. Because of deviations from normality, one-sided pvalues were confirmed using permutation-based analyses (10,000 permutations). R 2 = coefficient of determination, SE = standard error. (XLSX) S3 Table. Overview of variants with an MAF � 5% that were predicted to be potentially damaging by at least one of the five tools. Genetic variants with an MAF of � 5% that were predicted to be potentially damaging by at least one of the five applied algorithms are shown in S3 Table. The identified variants were shared by all investigated BD cases in the respective pedigree.