Novel candidates of pathogenic variants of the BRCA1 and BRCA2 genes from a dataset of 3,552 Japanese whole genomes (3.5KJPNv2)

Identification of the population frequencies of definitely pathogenic germline variants in two major hereditary breast and ovarian cancer syndrome (HBOC) genes, BRCA1/2, is essential to estimate the number of HBOC patients. In addition, the identification of moderately penetrant HBOC gene variants that contribute to increasing the risk of breast and ovarian cancers in a population is critical to establish personalized health care. A prospective cohort subjected to genome analysis can provide both sets of information. Computational scoring and prospective cohort studies may help to identify such likely pathogenic variants in the general population. We annotated the variants in the BRCA1 and BRCA2 genes from a dataset of 3,552 whole-genome sequences obtained from members of a prospective cohorts with genome data in the Tohoku Medical Megabank Project (TMM) with InterVar software. Computational impact scores (CADD_phred and Eigen_raw) and minor allele frequencies (MAFs) of pathogenic (P) and likely pathogenic (LP) variants in ClinVar were used for filtration criteria. Familial predispositions to cancers among the 35,000 TMM genome cohort participants were analyzed to verify the identified pathogenicity. Seven potentially pathogenic variants were newly identified. The sisters of carriers of these moderately deleterious variants and definite P and LP variants among members of the TMM prospective cohort showed a statistically significant preponderance for cancer onset, from the self-reported cancer history. Filtering by computational scoring and MAF is useful to identify potentially pathogenic variants in BRCA genes in the Japanese population. These results should help to follow up the carriers of variants of uncertain significance in the HBOC genes in the longitudinal prospective cohort study.


Introduction
Since the precision medicine initiative was launched in 2015 by the US government [1], prediction of the disease risks of individuals by using their genomic information has become plausible in a clinical setting. In Japan, gene profiling assays for cancer tissues and companion diagnostic tests for cancer-predisposing genes are now covered by the national health insurance system. These gene profiling tests can examine variations in most of the genes conferring susceptibility to two major adult-onset hereditary cancer-predisposing syndromes, hereditary breast and ovarian cancer syndrome (HBOC) and Lynch syndrome. Nowadays, the clinical significance of variants of these genes is important for patient care and the health of their relatives at the bedside.
The correct judgment of the pathogenicity of germline variants in these cancer-predisposing genes is critical for the physicians who manage such patients and undertake gene profiling analyses for cancer treatment. For example, in carriers of disease-causing mutations of HBOC, prophylactic surgery is beneficial [2,3]. Testing of BRCA genes may help carriers' decisionmaking regarding prophylactic salpingectomy or salpingo-oophorectomy because, in patients with high-grade serous carcinoma arising from the fallopian tube, germline BRCA mutations are more prevalent in Japanese women than in other ethnic groups [4,5]. Synthetic lethal drugs for cancers associated with homologous recombination defects are available for patients carrying disease-causing mutations of HBOC [6,7]. In this context, variants of uncertain significance (VUSs) would clearly be a source of major problems for clinicians. Kurian et al. reported that inexperienced breast surgeons tend to manage patients with VUSs in the BRCA1 or BRCA2 gene as pathogenic HBOC mutation carriers [8]. This means that the lack of comprehensive annotation methods for variants might cause overdiagnosis or overtreatment in patients with BRCA mutations that are uncharacterized but actually benign.
To overcome these difficulties, several levels of studies (single organization, single nation, and whole world level) have been done previously. As a single organization study, Sugano et al. reported the BRCA1 and BRCA2 germline variants in 135 HBOC patients and identified 28 pathogenic ones [9]. As the nationwide study, Arai et al. examined 830 Japanese HBOC pedigrees collected by the Japanese HBOC consortium and identified 49 different pathogenic variants among them [10]. Similarly, a nationwide multicenter study revealed that germline BRCA 1/2 mutations were present in 14.7% of 634 Japanese women with ovarian cancer [5]. Lee et al. also examined the variants in the BRCA1 and BRCA2 genes in breast and ovarian cancer patients' germline genomic DNA and calculated posterior probabilities for the diseasecausing mutations; they identified five previously unreported variants as candidate pathogenic ones [11]. Finally, as an international study, the BRCA Challenge project established an open access database, BRCA Exchange for providing reliable and easily accessible variant data for better clinical treatments of HBOC [12]. As of October 2020, the BRCA Exchange database has collected more than 40,000 variants in the BRCA1/2 genes from major clinical databases and estimated their pathogenicity under expert peer review in collaboration with the ENIGMA consortium [13]. The purposes of this comprehensive database are to provide reliable and easily accessible variant data interpreted for the high-penetrance phenotype of HBOC and to develop a model database for the utilization and sharing of public data to provide better clinical treatments of hereditary disease. In this database, there are more than 4,900 variants annotated as "pathogenic" by the ENIGMA consortium. Recently, a large-scale Japanese project involving the sequencing of HBOC patients' germline genomic DNA for 11 breast cancer-predisposing genes revealed 134 pathogenic germline variants concentrated in cancer patients in the BRCA1 and BRCA2 genes [14]. Patient-based studies for identifying germline pathogenic variants are very effective for identifying potential variants of this kind, but cannot estimate disease and genomic information; both of these sets of data are private and it would be possible to identify an individual with them. Therefore, it is necessary to obtain approval for data access from the TMM prospective cohort project; specifically, users should obtain approval from the sample and data access committee of the TMM Biobank. This committee consists of experts both inside and outside the TMM. Upon applying to this committee, the Group of Materials and Information Management in the TMM at Tohoku University supports the procedures for data transfer. The Group of Materials and Information Management can be contacted at dist@megabank.tohoku.ac.jp.
the frequencies of those alleles in the general population, which is critical for estimating the number of HBOC patients in a community. In addition, moderately deleterious HBOC gene variants contribute to increase the risk of breast and ovarian cancers in a population, so identifying them is critical for establishing personalized health care. The carriers of moderately deleterious HBOC variants would not undergo drastic prophylactic modalities, but frequent examination would be recommendable for earlier detection of the cancers. A prospective cohort subjected to genome analysis would provide both sets of information.
Only analyses of prospective cohorts of the general population can confirm the causality of VUSs via the collection of follow-up data and using the precise minor allele frequencies. However, in the case of follow-up surveys in prospective cohorts, it is critical to focus on the participants who need to be carefully followed up because of the limitation of available resources [15]. An appropriate method to select participants for detailed follow-up studies is critical for analyzing the causalities of germline VUSs in cancer-predisposing genes.
Here, we describe the levels of known and potentially disease-causing variants in the BRCA genes among the general Japanese population, by analyzing a whole-genome reference panel for the Japanese characterized by the Tohoku Medical Megabank (TMM) Project. The TMM Project involves a combination of prospective cohort, biobanking, and genome-omics analysis (for reviews, see [16][17][18][19]). The dataset collected so far includes more than 3,500 independent whole-human-genome sequences (3.5KJPNv2) [20] with self-reported individual and family history data. The main benefit of the whole-genome sequencing of the dataset is that it provides more comprehensive information of the structure of the two HBOC genes than the exome-based approach. We also refer to these data to test whether computational annotation can identify any variants that might cause HBOC with high penetrance.

Ethics approval and consent to participate
This study was approved by the ethics committee of Tohoku Medical Megabank Organization at Tohoku University (registration number: 2018-4-003). All participants in the present study were recruited by Tohoku Medical Megabank Organization at Tohoku University and provided written informed consent to participate in the cohort study.

Dataset
Subjects were obtained from the TMM Community-Based Cohort (TMM CommCohort) Study established by Tohoku Medical Megabank Project [21], in which more than 120,000 adults participate. The whole-genome sequences of some of the participants have been obtained; the criteria for selecting WGS samples are described elsewhere [19,22]. In brief, the samples for development of the Japanese whole-genome sequencing dataset were selected based on the SNP array data of the samples. Only one sample was picked up from a kinship group to obtain the precise allele frequencies. The whole-genome sequencing was performed with HiSeq 2500 sequencers (Illumina, Inc., San Diego, CA) with a PCR-free protocol from the genomic DNA extracted from whole blood.
The BRCA variants in 3.5KJPNv2 were annotated with the InterVar [27] command line package (default options), which depends on ANNOVAR [28]. InterVar is an analytical package to estimate the clinical impact of gene variants based on guidelines for variant interpretation, namely, the American College of Medical Genomic Guidelines and those of the Association for Molecular Pathology in clinical sequencing [29]. InterVar includes annotations of ClinVar (version from December 1, 2015) [30] and predicts pathogenicity, using indices such as Combined Annotation Dependent Depletion (CADD) [31], DANN [32], and Eigen [33]. The positions of the candidate pathological variants found in the Korean population [11] were described as the cDNA positions. To apply the data to the InterVar software, the TransVar annotation program [34] was used to obtain the genomic positions of the variants, followed by the InterVar annotation described above. To compare the variant frequencies in 3.5KJPNv2 and in the gnomAD database for the BRCA1/2 variants, we downloaded gnomAD data [35] from the associated webpage (https://gnomad.broadinstitute.org/; downloaded on February 23, 2020). The selected variants were visualized with the mutation mapper at cBioPortal [36,37].
The RIKEN 2000 genome allele frequency data [38] were downloaded from the Japanese Encyclopedia of Genetic Associations (http://jenger.riken.jp/data) and TCGA germline variant data were as described previously [39].

Obtaining individual and family histories
TMM prospective cohort project data are stored in a supercomputer system, with secure data access [40]. The TMM database is a relational database and it consists of several separate datasets. The key is the participants' IDs to link the information stored in the different tables. The individual and family histories were extracted from a large data matrix consisting of selfreported findings from a paper-based questionnaire given to the members of the cohort. The dataset consists of 35,199 participants in the TMM CommCohort and the data were frozen for distribution to the Japanese scientific community in 2017, as a provisional version. For most of the participants, whole-genome sequencing data are not available. The detailed method for obtaining the participants' past and family histories, which consist of 269 entries for malignant neoplasms and 1271 for other diseases, is described elsewhere [21]. We did not use TMM Project Birth and Three-Generation Cohort data because the participants are expected to be relatively young and their family members may not be old enough to obtain positive cases [41].
The self-reported questionnaire data were filtered out for the participants who checked more than 50 items for past and family histories of malignant neoplasms. Most of the participants who checked more than 50 items showed contradictory histories, such as a self-history of ovarian cancer being recorded by male participants. Therefore, we decided to remove such records and obtained 35,136 records as a result. In the statistical analysis comparing carriers of candidate BRCA pathogenic variants and other TMM CommCohort participants regarding self-reported individual and family histories, we employed the binomial distribution to calculate the p-value. Then, we calculated the accumulation of past and family histories only for the items of malignant neoplasms. The questionnaire just asked about the presence or absence of such histories, which could be represented as "0" or "1" for each item. This made it impossible to give a weight to the numbers of affected siblings or offspring.
In terms of the access to data from the TMM prospective cohort project, users should obtain approval from the sample and data access committee of the TMM Biobank [17]. This committee consists of experts both inside and outside the TMM. Upon the receipt of an application to the committee, the Group of Materials and Information Management in the TMM at Tohoku University supports the procedures for data utilization.

Statistics
To analyze the correlations among the three computational estimates of the impacts of variants, we employed R 3.6.1 for calculating the Pearson correlation coefficient. We applied Fisher's exact test and chi-squared test with Yates' correction for calculating the p-values of the differences in numbers of cancer-bearing family members.

Summary of BRCA variants in 3.5KJPNv2
More than 3,600 variants were found in the BRCA genes, 6.15% of which are in coding regions. The total proportion of coding exonic regions of the two genes is 9.58% in hg19 and 23.1% of the total variants in the two genes in 3.5KJPNv2 are indels. Indel calling using the short-read sequence data is less reliable than the findings for single-nucleotide variants, so the indels found in 3.5KJPNv2 may require further verification using long-read sequencing data.
How many known pathogenic mutations of the BRCA genes are identified in 3.5KJPNv2? We estimated this in a previous study on 2KJPN [42], relative to which there should be more pathogenic variants here. S1 Table indicates the annotation results of the variants in the BRCA1 and BRCA2 regions using the InterVar package. Ten variants in the BRCA genes are annotated as "pathogenic" (P) or "likely pathogenic" (LP) by referring to the ClinVar database. The accumulated frequency of pathogenic variations of BRCA genes in 3.5KJPNv2 is 0.0018, which might be lower than the clinical estimation of HBOC carriers in Japan.
To obtain deeper insight into the 3.5KJPNv2 BRCA variants, we compared the results with the gnomAD database, which contains more than 130,000, multi-ethnic, human exome variants (https://gnomad.broadinstitute.org/) (S1 Table). The total number of pathogenic variants in the BRCA genes is much smaller in 3.5KJPNv2 than in gnomAD (S1 Table). However, considering that the numbers of collected samples are quite different, the numbers of P and LP variants in 3.5KJPNv2 per population were very similar to those for gnomAD. Specifically, the rates of ClinVar P or LP variants were 2.81 × 10 −3 /person and 2.50 × 10 −3 /person in 3.5KJPNv2 and gnomAD, respectively. To investigate the population specificity, we extracted ClinVar and InterVar P or LP variants found in East Asian populations in the gnomAD database (gnomAD-EAS; Table 1). Intriguingly, there were only four overlaps between 3.5KJPN and gnomAD-EAS for ClinVar and InterVar P or LP variants (Table 1). For example, one of the most prominent BRCA1 pathogenic variants, L63X [9,10], does not appear in gnomAD-EAS (Table 1). In contrast, the two most prevalent P or LP variants, BRCA2 p.G2508S and p.A2786T, are present in 3.5KJPNv2. These two variants may be commonly distributed among East Asian populations. These results support the notion that pathogenic variants of a gene are highly specific to each ethnic group and thus that population-specific collection of whole-genome sequencing data is critical for nationwide public health care planning [19].

Estimation of pathogenic variants in the two BRCA genes in the Japanese population
In the case of ClinVar, the data are based on previous reports of the identification of pathogenic variants in disease-predisposed families, so there might be new, unreported pathogenic variants to be found in the general population. To address this issue, we applied an annotation approach with InterVar. As stated above, InterVar is designed to estimate the clinical importance of human genetic variants that have not been reported previously, in accordance with the American College of Medical Genetics (ACMG) Guidelines of secondary findings in clinical sequencing [29]. Interestingly, the package annotates another 13 variants as P or LP in the BRCA genes, as well as all of the 10 ClinVar P and/or LP variants. Among the 13 newly annotated P or LP variants, 4 are frameshift indels and 9 are nonsynonymous variants. None of these four frameshift indels is annotated with dbSNP, so it should not be considered as discordant with ClinVar. Four nonsynonymous variants detected by InterVar are annotated as "conflicting interpretation of pathogenicity" in the ClinVar database. One of the LP variants from InterVar, BRCA1 p.L52F, shows quite high minor allele frequency (MAF) in 3.5KJPNv2 (0.0037) compared with other definite ClinVar P or LP variants. This variant was estimated to be a VUS in the Japanese HBOC consortium study [10] and "likely benign" by Lee et al. in a Korean prospective study on breast cancer patients.
There is a large publicly available dataset of Japanese whole-genome sequencing data from RIKEN [38]. It consists of deep sequencing data from 2,234 whole genomes (average depth of 25×), 1,939 of which are from BioBank Japan (BBJ), a large biobank of patients suffering from more than 50 diseases [43]. The detailed composition of the samples from BBJ is not available, but 1,276 patients with six diseases including breast cancer are included. Hence, it can be expected that pathogenic variants found in 3.5KJPNv2 might be enriched in the RIKEN dataset, although the selection criteria of the samples for the RIKEN project are unknown. As expected, two InterVar P or LP variants, BRCA2 c.5573_5577C and BRCA1 p.L63X, are enriched (9.75-and 16.2-fold, respectively) in the RIKEN dataset (S2 Table). In addition, a pathogenic variant not found in TMM 3.5KJPNv2 was identified (BRCA2 p.E2877X). In contrast, the prevalent InterVar LP variant, BRCA1 p.L52F, is not enriched in the RIKEN Japanese whole-genome dataset (0.5-fold, S2 Table). Similarly, we checked the germline variants of the BRCA genes in TCGA dataset [39] and found three BRCA2 pathogenic variants that overlapped with 3.5KJPNv2 (p.T219fs, p.T1858fs, and p.N2134fs); all three of these are highly enriched in TCGA (382-fold, 318-fold, and 95.5-fold, respectively).
These results suggest that the annotation by InterVar may include false positives as well as false negatives, although no VUSs identified by the Japanese HBOC consortium are included in our estimation [10]. Precise data on the MAFs obtained by the unbiased selection of panel constituents from the general population are critical for estimating the pathogenicity of VUSs and should be included in the criteria of pathogenicity for adult-onset hereditary disorders such as HBOC based on the InterVar annotation.

Estimate of computational scoring tools' performance in predicting pathogenicity of novel 3.5KJPNv2 BRCA variants
InterVar annotates the variants' functional impact based on the ACMG guidelines and it largely depends on previous reports to define the parameters for scoring. For example, criterion PS1 of InterVar states that "the variant involves the same amino acid change as a previously established pathogenic variant regardless of nucleotide change." This means that one needs previous knowledge about pathogenic variants in order to annotate a variant as "pathogenic" by InterVar. In contrast, only one supportive item, PP3, is used from computational estimations in InterVar: "Multiple lines of computational evidence support a pathogenic effect on the gene or gene product (e.g., conservation, evolutionary, splicing impact). Hence, InterVar may underestimate the clinical impact of potentially pathogenic variants about which previous information is not available. The tendency might be worse in the noncoding regions in the coding genes like the BRCA1/2 genes because of the lack of functional studies for such regions. Nowadays, the whole genome sequencing data is accumulating and comparisons between the phenotypes and variants in the noncoding regions found by the WGS will provide critical data for the interpretation of the noncoding variants. Therefore, we would like to test whether the unbiased, computational estimations of the pathogenicity of the variants can be used to find potentially pathogenic variants without previous knowledge. The Pearson correlation coefficients of CADD_phred with DANN_rankscore and Eigen_raw were determined to be 0.815 and 0.860, respectively, showing that both DANN and Eigen correlate well with CADD. However, interestingly, the distributions of ClinVar and/or Inter-Var P or LP variants were quite different. CADD_phred and DANN_rankscore showed wider distributions in P or LP variants than CADD_phred and Eigen_raw. The Pearson correlation coefficients of CADD_phred with DANN_rankscore and Eigen_raw were 0.127 and 0541, respectively. Interestingly, in both of the scatter plots, BRCA1 p.L52F, a benign variant annotated as LP by InterVar, showed similar scores to the other P or LP variants in the three parameters. S3 Table shows the details of the computational scoring for the ClinVar/InterVar P or LP variants. The ClinVar P or LP variants clearly showed higher average and minimum scores for CADD_phred and Eigen_raw scores for InterVar P or LP variants, but not for DANN_rankscore. Based on this observation, we decided to use CADD_phred and Eigen_raw for further filtration of potentially pathogenic mutations.
Minor allele frequencies are also critical parameters for interpreting the clinical impact of germline variants. As expected, both CADD_phred and Eigen_raw show weak positive correlations with the reverse logarithmic minor allele frequencies (Pearson correlation coefficients = 0.172 and 0.161, respectively). The CADD_phred and Eigen_raw scores of the InterVar P or LP variants are similar to those of the ClinVar P or LP variants (Table 2), with the exception of the BRCA1 L52F variant. Based on these comparisons, we defined computational thresholds for possible pathogenic BRCA single-nucleotide variants as follows: CADD_phred�25.9, Eigen_raw�0.501, and MAF�0.0003 (Fig 1A).
Eight BRCA variants that fulfill these three criteria defined by the ClinVar P or LP variants are present in 3.5KJPNv2 ( Table 2). One of these, BRCA2 p. G1529R, is annotated as "benign" or "likely benign" by ClinVar and InterVar, respectively. This variant is quite rare but found in two different ethnic groups, namely, African-Americans and non-Finnish Europeans (minor allele frequencies of 0.0003 and 0.0007, respectively; Table 2). Because ClinVar annotated the variant as "likely benign," we excluded it from further analysis. A summary of the filtering criteria is shown in Fig 1A.  We also tested these criteria for the 134 potentially pathogenic BRCA variants for women that were shown to be enriched in breast cancer cases in a previous study [14]. Among them, only 13 variants are found in the latest version of the Japanese whole-genome reference panel (4.7KJPN: jmorp database: https://jmorp.megabank.tohoku.ac.jp/202001/) and all of the available MAFs are � 0.0003. Eighty-seven variants are annotated as P or LP for both ClinVar and InterVar and 130 variants are annotated as P or LP in either ClinVar or InterVar. Four variants were annotated as "pathogenic" by Momozawa et al. [14], but not annotated as P or LP by InterVar (S4 Table). Among them, three variants showed high CADD_phred (24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35) and Eigen_raw scores (0.571-0.871) (S4 Table). One exception is BRCA1 p.K1095E, which is annotated as "likely benign" by InterVar and neither the CADD_phred nor the Eigen_raw score reaches our criteria to define it as pathogenic. Therefore, our criteria correspond well to the previous studies.
A summary of the variants identified in this study is shown in Table 2 and the distribution of the candidate pathogenic variants in the BRCA proteins is shown in Fig 1B. The nonsynonymous variants tend to localize at the C-terminal of the genes, while the frameshift indels and stopgains are localized between the N-terminal and the middle of the protein sequence. BRCA2 I2675V is known as a "splicing error-causing variant" [44] and it is the most C-terminal-end variant causing large structural changes in the BRCA2 mRNA in our collection. So far there is no other splicing error-causing variants of the BRCA1/2 genes in the 3.5KJPN. As shown in Table 1, there are two splicing-affected variants (BRCA2:g.chr13: 32890558AGdel and BRCA2:g.chr13: 32937315G>A) in the gnomAD-EAS, indicating that we did not miss a large numbers of splicing error causing variants in the BRCA1/2 genes in the 3.5KJPN. We obtained additional annotations at the cBioPortal to draw a schematic diagram; three candidate pathogenic variants identified based on the three criteria are annotated as likely oncogenic, as well as four InterVar P or LP variants (Table 2). This indicates that our approach can effectively identify the pathogenic variants in the BRCA genes. Sugano et al. described BRCA2 Y1853C as a VUS, although both ClinVar and InterVar annotated it as LP [9]. Later, Kawatsu et al. showed the pathogenic potential of this variant by experimental and genetic analyses [45]. Similarly, another variant, BRCA2 p.G2508S, is annotated as "likely neutral" by the OncoKB database. However, this variant was recently described as "moderately oncogenic" by Shimelis et al., based on a genome-wide association study of more than 12,000 cases and controls [46]. Therefore, we decided to include this variant for further study.
In the variant call procedures by Tadaka et al., there is no step for ruling out the false positives caused by clonal hematopoiesis [47]. The basic quality control steps when generating 3.5KJPNv2 were described by Tadaka et al. [20], suggesting that some of the BRCA variants analyzed in this study might have originated from somatic mutations in the blood leucocytes from the cohort participants. However, we believe that clonal hematopoiesis should not have contributed substantially to our dataset for the following reasons. First, Tadaka et al. used GATK haplotypecaller for variant calling for 3.5KJPNv2, which is suitable to detect variants in near-diploid genomes. Thus, most of the variants caused by clonal hematopoiesis would not reach sufficient variant read depth in a WGS sample. In addition, the average age of individuals from whom the samples in 3.5KJPNv2 were derived was around 56 years old [20]. Clonal hematopoiesis occurs mainly in the elderly, becoming prominent in those aged over 65 [47]. Hence, clonal hematopoiesis may not have strongly affected our results.

Potentially pathogenic BRCA variant carriers tend to have cancer-prone family histories
Members of the TMM CommCohort reported their individual and family histories of various disorders including cancers by completing a paper-based questionnaire. It is possible that the BRCA pathogenic variant carriers and their family members would suffer from cancers more often than other cohort members and their family members. Fig 2 indicates the numbers of cases of cancer among the participants themselves, their family members, and their spouses. Although the number of non-carrier participants was more than 1,500 times greater than the number of InterVar P or LP and computational + MAF-selected BRCA variant carriers, the overall profiles of cancer onset were similar. For example, fathers of the participants suffered more from cancers than mothers, regardless of the participants' status in terms of BRCA variants. A prominent difference between those definitely carrying potentially pathogenic BRCA variants and the rest of the cohort was in the rate of cancer-bearing sisters: the InterVar P or LP carriers were shown to have a much higher rate of cancer-bearing sisters than the rest of the cohort (Fig 2 and S5 Table; p = 3.08 × 10 −5 , chi-squared test with Yates' correction). In addition, the rate of cancer-bearing offspring was higher in the InterVar P or LP carriers than in the others with marginally significant (p = 0.041). Interestingly, the rate of cancer onset of the participants themselves did not differ markedly between the InterVar P or LP carriers and the rest of the cohort members. This may be reasonable as nearly half of the InterVar P or LP carriers are male and thus are less likely to suffer from BRCA-related breast cancers (S5 Table).
Numbers of positive cases of self-reported individual and family histories of cancer among the TMM CommCohort participants. Vertical axes indicate the number of cases with positivity for each item below the horizontal axis. The right and left axes indicate the BRCA candidate pathogenic variant-positive and -negative cases, respectively. The scales of the vertical axes are adjusted by showing the "Family" bars at the same height. "Total cases" indicates the number of cases analyzed, while "Self" indicates individual past history of any malignancy. "Father," "Mother," "Brother," "Sister," "Offspring," and "Spouse" indicate the cancer-related histories of the participants' family members. "Family" indicates a case of any cancer among any of the blood relatives, except the participants themselves. Cases in which the "Spouse" was positive are not included in "Family." Solid and gray bars represent numbers of cases positive for the BRCA candidate pathogenic variants and the rest of the TMM CommCohort cases, respectively. Asterisks indicate statistically significant differences (single: p < 0.05, double: p < 10 −4 ) upon comparison with the total analyzed TMM CommCohort cases (Fig 2).
Recent progress in bioinformatics may open up a completely different path for filtering the VUSs in hereditary disorders, namely, artificial intelligence-mediated approaches. One example of this is CADD, which was reported in 2014 [31]. CADD scores are based on calculations of all of the possible 84 billion single-nucleotide changes in the human genome. Such calculation is based on machine learning using the evolutionarily conserved "proxy-neutral" variants found in both apes and humans and the recently emerged "proxy-pathogenic" rare variants in the human genome alone [48]. In 2016, a further dataset, Eigen, was released, for which calculation was performed without training data but with a principal component that gives the largest diversity among the variants prepared from all possible single-nucleotide changes in the human genome [33]. These annotation tools have achieved some clinically significant findings in genome-wide association studies (for example, see [49,50]). Recently, The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium successfully applied CADD to estimate the biological impacts of cancer mutations [51]. Therefore, computational scoring is suggested to be very powerful at predicting the clinical impact of single-nucleotide variants in cancer- predisposing genes. The present study shows the potential for applying this approach to find pathogenic variations in cancer-predisposing syndromes by using genome reference panels with precise MAF estimation. This study shows that the MAF estimate for the general population is much more useful for the annotation of pathogenic variants than the biased collection of population samples.
Recently, Findlay et al. reported that a saturation genomics-based approach could functionally characterize more than 4,000 BRCA1 variants that are in the functionally critical regions [52]. Thirteen BRCA1 variants in the 3.5KJPNv2 corresponds to the list of Findlay et al. and among them, three loss of function variants were annotated by Findlay et al. (L63X, R71S, and  Y1853C). Specifically, among them, there were two discordant variants between the work of Findlay et al. and the present study. BRCA1 p.R71S was not picked up by our survey but annotated as "loss of function" in the dataset by Findlay et al., while BRCA1 V1696M was picked up by our survey but annotated as "functional" in the same dataset. It is possible that the pathogenicity of BRCA variants would be affected by other genetic modifiers and/or environmental factors. Most of the computational methods for estimating the impact of genetic variants depend on "known datasets" when they perform machine learning. There are probably many "unknown factors" that are essential for the correct estimation of pathogenicity of variants. Further studies should be performed to provide new and critical information for the computational estimation of pathogenicity of genetic variations. Follow-up of the carriers of these variants in prospective cohort studies may also provide clues to resolving any discordant results.
There was a significant preponderance of cancer in the family histories of those with potentially pathogenic BRCA variants only among the sisters of TMM CommCohort members. The carriers found in the TMM CommCohort were mainly male and the female carriers were relatively young, so they themselves had not yet accumulated many cancer cases. A preponderance of a history of cancer in the mothers was not observed, but the mothers should have been aged over 80, so their accumulation of sporadic cancers would have obscured the HBOC cases. Our study suggests that the self-reported data of the TMM CommCohort are useful to analyze the genotype-phenotype relationships, at least in cancer-predisposing syndromes.
It is not easy to estimate the clinical significance of the VUSs that may have clinically significant effects on the hosts' predisposition for cancer, but with relatively low penetrance. It is an important insight that VUSs may have moderate but significant effects on cancer onset that can be reduced by personalized health care based on information on the genetic variant. Around 10 years ago, a review paper by Berger et al. proposed that haploinsufficiency is not so uncommon in the onset of cancer in HBOC patients with pathogenic variants in the BRCA genes [53]. Moderately deleterious variants are also critical for the successful establishment of precision medicine and/or personalized health care [54]. The carriers in moderately penetrant HBOC families may not be critical to prompt radical interventions such as prophylactic surgery, but the carriers may be encouraged to continue undergoing close health checks to detect HBOC cancers as early as possible.

Conclusions
The present study indicates that a large dataset of Japanese whole-genome sequencing data (3.5KJPNv2) includes definitely and potentially pathogenic variants in representative genes responsible for HBOC: BRCA1 and BRCA2. ClinVar and the ACMG-guided annotation tool InterVar detected more than 20 variants as pathogenic or likely pathogenic, including one obviously benign variant in 3.5KJPNv2. In addition, the use of the combination of computational scoring and MAF picked up another eight candidates, including one likely benign mutant as defined by ClinVar. Some of the variants show concordance with other databases in terms of the pathogenic annotations. The self-reported individual and family histories of the carriers of potentially pathogenic BRCA variants were analyzed and the carriers' sisters showed a significant history of cancer themselves. This study indicates that prospective genomic cohort studies are a powerful tool for identifying pathogenic variants. The present study should be useful for identifying such moderately deleterious variations in populations and contribute to the development of personalized health care based on individual genomic information.
Supporting information S1