Figures
Abstract
Identifying genes associated with rare diseases remains challenging due to the scarcity of patients and the limited statistical power of traditional association methods. Here, we introduce PERADIGM ( Phenotype Embedding similarity-based RAre DIsease Gene Mapping), a novel framework that leverages natural language processing techniques to integrate comprehensive phenotype information from electronic health records for rare disease gene discovery. PERADIGM employs an embedding model to capture relationships between ICD-10 codes, providing a nuanced representation of individual phenotypes. By utilizing patient similarity scores, it enhances the identification of candidate genes associated with disease-specific phenotypes, surpassing conventional methods that rely on binary disease status. We applied PERADIGM to the UK Biobank dataset for three rare diseases: autosomal dominant polycystic kidney disease (ADPKD), Marfan syndrome, and neurofibromatosis type 1 (NF1). PERADIGM identified additional candidate genes associated with ADPKD-related and Marfan syndrome-related phenotypes, some of which are supported by existing literature, and demonstrated enhanced signal detection for NF1-specific phenotypes beyond traditional methods. Our findings demonstrate the potential of PERADIGM to identify genes associated with rare diseases and related phenotypes by incorporating phenotype embeddings and patient similarity, providing a powerful tool for precision medicine and a deeper understanding of rare disease genetics and clinical manifestations.
Author summary
Rare diseases are difficult to study because they affect few people, even in large population studies. This makes it challenging to identify genetic variants that may contribute to these conditions. In this work, we developed a new approach that uses patterns in patients’ medical histories to help discover genes that may be involved in rare diseases. Instead of relying only on whether a person has a recorded diagnosis, we represent each person’s full clinical profile using information from their medical records. We then compare individuals based on how similar their overall health patterns are. We found that people who carry harmful genetic variants often share characteristic symptom patterns, even if they have not received a formal diagnosis. By using this similarity, our method increases the ability to detect genes that may influence disease features. We applied our approach to three disorders and identified both well-known disease genes and additional genes that may modify disease expression in different individuals. Our study highlights the value of using comprehensive medical records to improve the discovery of genes associated with rare diseases.
Citation: Zheng W, Xie Y, Gu J, Li H, Somlo S, Besse W, et al. (2025) PERADIGM: Phenotype embedding similarity-based rare disease gene mapping. PLoS Genet 21(12): e1011976. https://doi.org/10.1371/journal.pgen.1011976
Editor: Jingjing Yang, Emory University School of Medicine, UNITED STATES OF AMERICA
Received: August 13, 2025; Accepted: December 2, 2025; Published: December 18, 2025
Copyright: © 2025 Zheng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: This study did not generate any new data. All analyses were conducted using existing genotype and phenotype data from the UK Biobank, accessible at https://www.ukbiobank.ac.uk/. The code for the PERADIGM framework is publicly available at https://github.com/JJJJJasonZheng/PERADIGM.
Funding: The study was supported by the National Institute of Child Health and Human Development (NICHD); National Institute of General Medical Sciences (NIGMS) Funder websites: https://www.nichd.nih.gov; https://www.nigms.nih.gov. Grant numbers: R03 HD100883, R01 GM134005 to HZ. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The advent of Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) [1] has facilitated the discovery of rare variants and their associations with both common and rare diseases. Numerous rare variant association methods have been developed to analyze WES and WGS data [2], characterizing a wide range of genotype-phenotype relationships. Despite some successes, these methods lack statistical power when the target phenotype is rare, which is often the case for rare diseases [3], even in large biobanks. Moreover, because rare diseases are often driven by rare variants, the likelihood of detecting significant associations is further reduced. For example, in the UK Biobank [4], which includes over 500,000 participants, the number of individuals with specific rare diseases is often small, making it difficult to identify associations using traditional rare variant association methods such as the Burden test [5], SKAT [6], and SKAT-O [7]. Historically, research on Mendelian diseases has focused on pathogenic variants. For instance, Autosomal Dominant Polycystic Kidney Disease (ADPKD) is primarily attributed to pathogenic variants in the PKD1 and PKD2 genes [8], while Marfan syndrome [9] and Neurofibromatosis type 1 (NF1) [10] are mainly associated with the FBN1 and NF1 genes, respectively. However, known pathogenic variants explain only a fraction of disease cases, suggesting the presence of additional candidate genes associated with disease-specific phenotypes. Identifying these genes remains challenging for traditional rare variant association methods due to the limited number of rare disease patients, even within large biobank cohorts.
Rare disease patients typically present unique phenotypic profiles [11], largely due to the strong effects of rare pathogenic variants [12]. As a result, these patients often exhibit distinctive phenotypes beyond their primary disease status. However, previous methods primarily rely on binary disease classifications, overlooking the rich phenotype data available. Electronic Health Records (EHR) [13] provide a valuable source of comprehensive phenotype information across large cohorts. In the UK Biobank inpatient dataset [4], longitudinal phenotype data for each individual are recorded using the 10th revision of the International Classification of Diseases (ICD-10) codes [14]. Most studies incorporate ICD-10 phenotype information as binary variables in association analyses, including genome-wide association studies (GWAS) [15], phenotype-wide association studies (PheWAS) [16], and rare variant association tests [6,7,17]. However, treating ICD-10 codes as binary variables may lead to information loss, particularly in capturing relationships among phenotypes. Relevant secondary manifestations may be present even when the primary diagnosis is not explicitly coded, either due to incomplete clinical documentation or enrollment prior to a formal diagnosis. While this may be less common in the UK Biobank cohort, which primarily includes older individuals, it remains a possibility. Some recent studies have demonstrated the value of expanding phenotype definitions in electronic health records (EHR) to better capture disease heterogeneity and support genetic discovery. For example, the Phenotype Risk Score (PheRS) framework [18] and subsequent extensions [19,20] leverage quantitative EHR-derived features to move beyond binary case definitions, improving rare variant interpretation and cross-biobank generalizability. Additionally, for some rare diseases, precise ICD-10 codes may be missing in the UK Biobank because participants tend to represent a relatively healthy population, and such diagnoses may be underreported or undetected. Leveraging the full spectrum of phenotype information can therefore provide a more comprehensive view of disease expression. In ADPKD, while PKD1 and PKD2 mutations are known to cause both kidney and liver cysts, the presence of liver cysts without kidney involvement may indicate other genetic causes. More broadly, the reliance on discrete ICD-10 codes can obscure phenotypic nuances, potentially limiting the identification of additional candidate genes associated with rare disease phenotypes. Besides identifying candidate genes for rare diseases, it is also valuable to uncover genes associated with disease-related phenotypes using EHR data. These phenotype-associated genes may provide further insights into genetic modifiers of disease phenotypes, secondary complications, and potential therapeutic targets. Addressing the limitations of ICD-10-based approaches by incorporating richer phenotype representations could improve the power of rare disease gene discovery and enhance our understanding of genotype-phenotype relationships.
To enhance phenotype information for candidate gene discovery, researchers increasingly apply natural language processing (NLP) models to phenotype data [21]. These models transform EHR phenotype terms into high-dimensional vector representations, generating phenotype embeddings that synthesize the entirety of available clinical data. Unlike traditional approaches that rely solely on ICD-10 codes, NLP-based embeddings can capture richer phenotypic representations by incorporating contextual information from structured and unstructured medical records. Established NLP embedding models, such as Word2Vec [22], generate static embeddings for medical events based on their co-occurrence patterns and relationships within the data. This approach enables phenotype embeddings to encode both the presence of specific phenotypes and their relationships with other clinical features documented in EHR, providing a more nuanced representation of a patient’s condition. By integrating textual and coded clinical data, NLP-driven embeddings offer a promising alternative to binary disease status representations, which may overlook critical phenotypic complexity. While some studies [23,24] have incorporated embedding models in gene discovery, their applications to rare diseases remain limited due to the scarcity of disease samples. As a result, few methods have leveraged embedding models for identifying candidate genes associated with rare disease phenotypes. Expanding NLP-based approaches in rare disease research could provide a more comprehensive understanding of genotype-phenotype relationships and improve the identification of relevant genetic contributors.
To address this research gap, we have developed a principled framework called Phenotype Embedding similarity-based RAre DIsease Gene Mapping ( PERADIGM) that integrates genetic data with phenotype information derived from ICD-10 codes. Using an embedding model to represent each individual’s phenotype profile, we found that rare disease patients and loss-of-function variant carriers exhibit significantly higher phenotypic similarity within their respective groups compared to random controls. Leveraging this similarity, we expanded the effective pool of rare disease patients, thereby increasing statistical power for identifying candidate genes associated with disease-specific phenotypes. We evaluated PERADIGM on three rare diseases: ADPKD, Marfan syndrome, and NF1 using data from the UK Biobank. By systematically scanning all genes, PERADIGM identified significant candidate disease-modifying genes in which rare variants may affect the disease-related phenotypic spectrum, uncovering additional associations beyond those detected by traditional methods, some of which are supported by prior studies. These findings demonstrate that PERADIGM can effectively extract and utilize phenotype information through embedding models, leveraging patient similarity to improve the identification of candidate disease-modifying genes for rare diseases.
Results
Overview of PERADIGM
PERADIGM is a phenotype embedding similarity-based statistical framework designed to identify candidate genes associated with disease-specific phenotypes for rare diseases using ICD-10 codes and sequencing data. It aims to improve statistical power in gene discovery, particularly in biobank datasets where some cases may be missed or misclassified during diagnosis. In this context, we define “risk genes” as candidate genes whose rare variants are associated with phenotypic manifestations related to the target disease. Importantly, investigating genes associated with disease-specific phenotypes, rather than focusing solely on the disease diagnosis itself, may provide additional insights into the underlying mechanisms of disease progression, variability in clinical presentation, and potential genotype-phenotype relationships. The PERADIGM framework employs a two-stage procedure, as shown in Fig 1.
The core concept of PERADIGM is that carriers of pathogenic variants exhibit ICD-10 phenotype profiles more similar to patients with the target disease. A: Individual-level ICD-10 codes are embedded using a Word2Vec model, and patient embeddings are generated using weighted averages of code embeddings. B: Pairwise cosine similarity is computed between rare LoF carriers and disease patients. C: A gene-specific risk score is derived and significance is assessed through random sampling.
In the first stage, we utilize individual-level ICD-10 codes from EHR data to create embedding representations for each phenotype using a continuous bag-of-words (CBOW) Word2Vec model. Patient embeddings are then generated by taking a weighted average of the phenotype embeddings, where the weights are determined based on two considerations. First, we consider the statistical evidence of association between each phenotype and the target disease. Second, we account for the prevalence of each phenotype in the biobank. Further details are provided in the Methods section. This weighting strategy captures both the phenotype’s relevance to the target disease and its information content, resulting in an embedded phenotype profile for each individual.
In the second stage, we conduct a genome-wide scan to calculate a risk score for each candidate gene with respect to the target rare disease, based on the embedded individual profiles. We hypothesize that genes associated with the target disease-associated phenotypes will show greater phenotypic similarity between individuals carrying rare loss-of-function (LoF) variants and those diagnosed with the disease. In PERADIGM, the similarity between two individuals is measured using the cosine similarity of their embedded phenotype profile vectors. We then derive a gene-specific risk score by comparing the phenotype embedding similarity between individuals carrying rare LoF variants and individuals diagnosed with the target disease. A higher similarity, and thus a higher risk score, between LoF variant carriers and disease cases suggests a higher likelihood that the candidate gene is associated with the disease. We assess the statistical significance of the observed score by comparing it to an empirical null distribution generated from repeated random sampling of individuals without regard to carrier status.
We note that PERADIGM uses widely available ICD-10 codes to identify candidate genes associated with rare disease-asscociated phenotypes, while some existing methods [25–27] integrate phenotype similarity using Human Phenotype Ontology (HPO) terms. By leveraging ICD-10 codes, PERADIGM can be more easily applied across diverse datasets and large-scale biobanks. In the following sections, we demonstrate the utility of PERADIGM through its application to several rare diseases using the first 200K WES release from the UK Biobank.
ADPKD
We first applied PERADIGM to identify candidate genes associated with ADPKD-related phenotypes [28]. ADPKD is one of the most common inherited kidney disorders, affecting approximately 1 in 400 to 1 in 1,000 individuals. It is characterized by the development of numerous fluid-filled cysts in the kidneys, leading to progressive renal dysfunction and, eventually, kidney failure. Approximately 78% of ADPKD cases are caused by pathogenic variants in the PKD1 gene, while another 15% result from variants in the PKD2 gene [29]. These genes encode polycystin-1 and polycystin-2, respectively, which are essential for maintaining the structural integrity of renal tubular cells and regulating calcium signaling. Disruptions to these proteins impair normal cellular function, leading to cyst formation and kidney enlargement.
While PKD1 and PKD2 account for the majority of ADPKD cases, variants in other genes have also been implicated in a small subset of patients. Although these genes are less frequent causes, they contribute to the clinical variability of ADPKD and broaden the spectrum of associated phenotypes [30]. However, due to the low prevalence of ADPKD in the UK Biobank dataset, only 142 patients were coded as either Q61.2 or Q61.3, traditional rare variant association tests can only identify PKD1 and PKD2 significantly after p-value adjustment. This limitation underscores the challenge of rare disease gene discovery and the potential value of leveraging phenotype-based similarity methods like PERADIGM to identify additional candidate genes associated with ADPKD-related phenotypes.
Before applying the PERADIGM approach, we first examined the intra-group similarity among individuals diagnosed with ADPKD and carriers of rare loss-of-function (LoF) variants in the PKD1 and PKD2 genes. Intra-group similarity is defined as the average pairwise phenotype similarity within a selected group, calculated based on each individual’s complete phenotype profile. We aimed to determine whether individuals within these two groups exhibited greater phenotypic similarity compared to randomly selected individuals. For the ADPKD group, the control group consisted of individuals without an ADPKD diagnosis in the inpatient EHR. For the variant carrier analysis, the control group included individuals without LoF or deleterious missense variants in the target gene.
We first considered ADPKD patients. As shown in Fig 2A, there is an apparent separation of pairwise similarity scores between the ADPKD disease group and the ADPKD non-disease group. The ADPKD patients generally had higher pairwise similarity scores than non-patients, supported by the mean difference. This indicates that, on the phenotype level, ADPKD patients were more similar to each other compared to non-ADPKD individuals. These results suggest that the ADPKD patients shared some common and unique phenotypes which are significantly different from the non-patient group. We next considered the rare LoF variant carriers of PKD1 and PKD2. As shown in Fig 2B and 2C, PKD1 and PKD2 rare LoF variant carriers groups also exhibited more similar phenotypes. PKD1 rare LoF variant carriers had a wider range of pairwise similarity scores, while the non-carrier group was more concentrated around 0.1. PKD2 rare LoF variant carriers’ pairwise similarity score distribution showed a similar trend to ADPKD patients, compared to the controls, although the difference was more mitigated than in the ADPKD patients group. The PKD1 and PKD2 results show that while the pairwise similarity trends differed between the two genes’ rare LoF carriers, both had apparent differences compared to the control group. Moreover, the overall difference was less pronounced than the ADPKD patients group, suggesting that PKD1 and PKD2 only account for part of the phenotype patterns for ADPKD, and some other phenotype patterns cannot be explained by these two genes alone.
A: Pairwise similarity within the ADPKD patient group and the control group. B: Pairwise similarity within the PKD1 rare LoF carrier group and the control group. C: Pairwise similarity within the PKD2 rare LoF carrier group and the control group. D: Top 10 significantly associated ICD-10 codes among ADPKD patients. E: Top 10 significantly associated ICD-10 codes among PKD1 carriers. F: Top 10 significantly associated ICD-10 codes among PKD2 carriers.
After analyzing the pairwise phenotype similarity among ADPKD patients, PKD1 and PKD2 genes, we next investigated which phenotypes contributed to the difference in phenotype patterns. In this analysis, we used logistic regression, adjusted for age and sex, to scan all the phenotypes available in the UK Biobank 200K dataset to identify phenotypes that are significantly associated with ADPKD. We considered 10,487 unique ICD-10 codes available in the UKBiobank 200K dataset.
For ADPKD, we identified 255 significantly associated phenotypes after p-value adjustment, excluding those present in only one patient. As illustrated in Fig 2D, the top 10 phenotypes demonstrate strong associations with kidney diseases, cystic conditions, and common ADPKD complications [29]. These results show that, as expected, ADPKD patients not only share the primary diagnosis but also exhibit a constellation of related phenotypes. Among the significant associations, conditions such as hyperkalemia and urinary tract infections, both commonly observed in chronic kidney disease and ADPKD, were also identified (see Supplementary Materials). The ability of our approach to capture these well-known phenotypic associations suggests that such phenotype data could be leveraged to enhance individual-level phenotype embeddings.
We further investigated the phenotypes associated with PKD1 and PKD2 rare LoF carrier status. As shown in Fig 2E and 2F, Q61.2 and Q61.3 were the two most significant phenotypes, as expected. Additionally, most of the top 10 significant phenotypes for PKD1 and PKD2 overlapped with those observed for ADPKD patients [8], further validating known genotype-phenotype relationships. Furthermore, we identified 68 significant phenotypes for PKD1 and 15 for PKD2 after p-value adjustment, which is substantially fewer than the 255 significant phenotypes identified for ADPKD patients. This discrepancy likely reflects the broader phenotypic spectrum captured in diagnosed ADPKD cases rather than solely the contributions of PKD1 and PKD2. Additionally, differences in coding practices among healthcare providers may contribute to variations in phenotype capture. Some providers may consistently document chronic kidney disease (CKD) as a general diagnosis during each visit while omitting more specific codes such as Q61.2 for ADPKD and other related findings. This limitation of ICD coding suggests that phenotype-based methods may help mitigate the effects of inconsistent coding practices. These results emphasize the potential of phenotype similarity-based approaches to improve the representation of individual disease profiles and enhance rare disease gene discovery.
After examining these phenotype patterns, we applied PERADIGM to all available genes in the UK Biobank 200K dataset to identify significant risk genes for ADPKD. After excluding genes without rare LoF variant carriers and those potentially affected by somatic mutations, and further restricting to genes with at least five rare LoF carriers, a total of 15,889 genes were included in the final analysis.
As shown in Fig 3A and Table 1, genome-wide scanning with PERADIGM identified three significant genes associated with ADPKD-related phenotypes after p-value adjustment. In comparison, SKAT-O identified only PKD1 and PKD2 after p-value adjustment (Fig 3B), whereas PERADIGM identified an additional candidate gene, IFT140 is a well-established gene in ciliopathies that has recently been implicated as a monogenic cause of renal cyst formation [31]. Mutations in IFT140 can disrupt cilia structure and function. However, SKAT-O failed to detect IFT140 as significant (raw p-value: 0.00146, adjusted p-value: 0.58, rank: 46), likely due to the small number of IFT140 rare LoF variant carriers among individuals with ADPKD diagnosed (only six individuals). This limited overlap reduces statistical power in binary status-based association tests like SKAT-O. In contrast, PERADIGM integrates phenotype information across all ADPKD-related phenotypes for each individual with rare LoF variants, providing a more comprehensive assessment beyond binary disease status. To further compare PERADIGM with SKAT-O, we examined overlapping genes among the top 100, 200, and 500 genes ranked by raw p-values. The results showed 6, 7, and 27 overlapping genes, respectively. The relatively small number of shared genes suggests that PERADIGM prioritizes different aspects of gene significance compared to SKAT-O, potentially capturing distinct biological signals related to ADPKD-specific phenotypes. These results demonstrate that incorporating phenotype similarity into gene discovery can enhance the identification of candidate genes associated with ADPKD-related phenotypes. By leveraging phenotype embeddings, PERADIGM not only validates known genetic contributors but also uncovers additional genes that may influence specific clinical manifestations, providing deeper insights into the phenotypic complexity of ADPKD.
A: QQ plot of ADPKD risk genes identified by PERADIGM. B: QQ plot of ADPKD risk genes identified by SKAT-O. C: QQ plot from PERADIGM using pre-selected ADPKD-related phenotypes. D: QQ plot from PERADIGM after excluding CKD patients. Blue points indicate previously validated genes; red points indicate newly identified genes.
Beyond the significant genes identified by PERADIGM (Table 1), additional evidence supports its enhanced power. Several genes previously linked to ADPKD, including ALG8 [32], ALG9 [33], and COL4A1 [34], showed marginally significant associations when analyzed with PERADIGM. These genes had minimal overlap between rare LoF variant carriers and ADPKD diagnosis (ALG8: 1, ALG9: 0, COL4A1: 0), which likely contributed to their lack of significance in SKAT-O. Among them, PERADIGM identified marginal associations for all three genes (Table 1). This result highlights how limited overlap between rare variant carriers and ADPKD diagnoses can reduce SKAT-O’s power. However, this may reflect incomplete ICD coding, undiagnosed cases, or reduced penetrance rather than a true absence of association. In contrast, PERADIGM remains effective by leveraging broader phenotype patterns.
Moreover, we conducted an additional analysis to refine the ADPKD-related phenotype representation for diagnosed patients. Instead of using the entire phenotype profile, we preselected ADPKD-related phenotypes based on prior expertise when generating patient group embeddings. This stricter phenotype selection enhances specificity, further validates PERADIGM’s results, and provides more explainable associations between significant genes and ADPKD-specific phenotypes. For ADPKD, we selected ADPKD, aneurysms, and liver cysts as key ADPKD-related phenotypes.
As shown in Fig 3C and Table 1, the significant genes identified remained nearly the same as those found using PERADIGM without phenotype preselection, with one additional identified gene ADTRP. This consistency indicates that both approaches yield robust gene discovery results and highlights PERADIGM’s ability to automatically assign higher weights to phenotypes most relevant to ADPKD. However, the results using the preselected phenotype group showed a stronger association of PKD1 and PKD2 with ADPKD-related phenotypes (both adjusted p-values = 0.00). This shift suggests that preselecting ADPKD-related phenotypes can slightly enhance statistical power, but the overall results remain consistent with those obtained without phenotype preselection, demonstrating the robustness of PERADIGM.
In addition, ADPKD is a specific subtype of chronic kidney disease (CKD), and the two conditions share many overlapping phenotypes. While CKD is often present in ADPKD patients, many of the complications and phenotypes observed in CKD can arise from a variety of non-genetic causes. To distinguish genes associated specifically with ADPKD-related phenotypes from those potentially confounded by common CKD, we conducted an additional analysis. Specifically, we excluded individuals diagnosed with CKD (ICD-10 code: N18) who did not also have a diagnosis of ADPKD. This filtering step aimed to remove individuals with non-ADPKD forms of CKD that could obscure the genetic signals specific to ADPKD. We then applied PERADIGM using this refined comparison group to identify genes associated with ADPKD-specific phenotypes. As shown in Fig 3D, all previously identified genes remained significant, consistent with our original findings. This result demonstrates the robustness of PERADIGM and its ability to detect ADPKD-related genetic associations even after removing potentially confounding CKD cases.
Marfan syndrome
We next applied PERADIGM to Marfan syndrome [35], an autosomal dominant connective tissue disorder primarily caused by mutations in FBN1. This gene encodes fibrillin-1, a crucial protein for elastic fiber formation. Defective fibrillin-1 disrupts connective tissue integrity, leading to characteristic features such as tall stature, long limbs, flexible joints, lens dislocation, and serious cardiovascular complications, including aortic aneurysms and dissections. A patient cohort is clinically defined by the revised Ghent nosology [36], which requires the presence of cardinal manifestations, principally aortic root dilatation (aneurysm) or ectopia lentis (lens dislocation). In the absence of one of these, the diagnosis for a cohort subject requires a pathogenic FBN1 mutation or a systemic score ( 7) based on characteristic features. These features include the skeletal manifestations of tall stature, long limbs (dolichostenomelia), and flexible joints, as well as pectus deformities (carinatum or excavatum), scoliosis, and mitral valve prolapse [37]. While FBN1 mutations are the main cause, other genes have been implicated in Marfan syndrome and Marfan-like conditions, including TGFBR1 and TGFBR2 [38]. Mutations in these genes can also impair connective tissue function, contributing to the syndrome’s clinical manifestations. This suggests that beyond FBN1, multiple genes may influence connective tissue integrity and related pathways. In the UK Biobank 200K dataset, rare variant association tests such as SKAT-O, SKAT, and Burden tests identified only FBN1 as a significant gene after p-value adjustment. We applied PERADIGM to Marfan syndrome to explore additional candidate genes associated with Marfan-related phenotypes.
We first examined the intra-group similarity among Marfan syndrome patients and rare LoF variant carriers of FBN1. In the UK Biobank dataset, 39 individuals were diagnosed with Marfan syndrome (ICD-10 code: Q87.4). As shown in Fig 4A, the intra-group similarity scores of Marfan syndrome patients differed substantially from those of non-patients. The boxplot revealed an average pairwise similarity score difference of 0.55, which is notably higher than that for ADPKD. This strong intra-group similarity indicates that Marfan syndrome patients exhibit distinct and characteristic phenotype patterns. We then analyzed the intra-group similarity scores of 9 individuals carrying rare LoF variants in FBN1, five of whom were diagnosed with Marfan syndrome. Although the number of carriers was small, Fig 4B shows a clear difference between carriers and controls. The similarity scores were sparse due to the limited number of carriers, but the overall trend suggests slightly higher intra-group similarity among FBN1 carriers. Notably, the phenotype pattern of FBN1 rare LoF carriers was weaker than that of diagnosed Marfan syndrome patients, as reflected by lower intra-group similarity scores. This suggests that FBN1 exhibits incomplete penetrance for Marfan syndrome and that additional phenotype patterns may exist beyond those explained by FBN1 alone. These findings indicate the potential involvement of other genetic contributors in Marfan syndrome.
A: Pairwise similarity within the Marfan syndrome patient group and the control group. B: Pairwise similarity within the FBN1 variant carrier group and the control group. C: Top 10 significantly associated ICD-10 codes among Marfan syndrome patients. D: Top significantly associated ICD-10 codes among FBN1 variant carriers.
Next, we examined the significantly associated phenotypes of Marfan syndrome and FBN1 rare LoF variant carriers to better understand the phenotypic contributors to Marfan syndrome. After p-value adjustment, 51 significantly associated phenotypes remained, excluding those present in only one patient. Fig 4C illustrates the top 10 significant phenotypes associated with Marfan syndrome, which are predominantly related to the aorta, congenital heart disease, and blood abnormalities—all common complications of the disorder [39]. Additionally, some significant phenotypes without a direct known relationship to Marfan syndrome were identified, such as retinal detachment with retinal break and ingrowing nails (see Supplementary Materials). This pattern of significant phenotypes, comprising both common complications and less typical associations, is consistent with our findings in ADPKD. We then investigated whether FBN1 rare LoF carriers exhibited similar significant phenotypes to Marfan syndrome patients. Due to the limited number of carriers, only six phenotypes were significantly associated with FBN1 rare LoF variant status. Despite the small sample size, all six phenotypes were known complications of Marfan syndrome (Fig 4D) and were also observed in diagnosed patients. However, many additional significant phenotypes observed in Marfan syndrome patients were absent in FBN1 rare LoF carriers. This indicates that FBN1 exhibits incomplete penetrance for Marfan syndrome, suggesting that other factors may influence the broader spectrum of significant phenotypes associated with the disease.
After exploring the phenotype patterns of Marfan syndrome and FBN1 rare LoF carriers, we applied PERADIGM to scan all genes in the UK Biobank 200K dataset to identify significant genes associated with Marfan syndrome-specific phenotypes [40]. As shown in Fig 5A and Table 2, the whole-genome scan identified 2 significant genes after p-value adjustment, whereas SKAT-O identified only FBN1 (Fig 5B). While FBN1 is a well-established gene for Marfan syndrome, COL5A1, primarily associated with Ehlers-Danlos syndrome (EDS), has been reported in cases where EDS and Marfan syndrome are difficult to distinguish due to overlapping connective tissue abnormalities [41]. COL5A1 mutations result in abnormal type V collagen production, leading to symptoms that resemble certain aspects of Marfan syndrome. In our dataset, none of the 26 COL5A1 rare LoF carriers overlapped with diagnosed Marfan syndrome patients, highlighting the limitations of traditional rare variant association methods in detecting such associations. The ICD-10 coding system sometimes struggles to distinguish rare diseases like Marfan syndrome and EDS, leading to potential misdiagnoses.
A: QQ plot of Marfan syndrome risk genes identified by PERADIGM. B: QQ plot of Marfan syndrome risk genes identified by SKAT-O. C: QQ plot from PERADIGM using pre-selected Marfan-related phenotypes. Blue points indicate validated genes; red points indicate newly identified genes.
We also conducted an additional analysis to refine the representation of the Marfan syndrome phenotype in diagnosed patients. Instead of using the entire phenotype profile, we embedded the diagnosed patients using only their longitudinal ICD-10 codes for Marfan syndrome, ensuring that the embeddings were derived solely from repeated Marfan syndrome diagnoses over time. This stricter embedding process focuses more specifically on Marfan syndrome itself, providing more explainable results. As shown in Fig 5C and Table 2, six genes were significantly associated with Marfan syndrome-specific phenotypes. These include the two previously identified genes, indicating that refining phenotype representation can help uncover additional relevant genes. Also, PERADIGM identified TGFBR2 gene associated with Loeys-Dietz syndrome [42], which has overlapping cardiovascular and skeletal manifestations with Marfan syndrome. Mutations in TGFBR2 disrupt the TGF-β signaling pathway, which is essential for connective tissue maintenance and repair, leading to clinical features that can resemble Marfan syndrome. In the UK Biobank dataset, there was no overlap between Marfan syndrome patients and TGFBR2 rare LoF carriers, preventing traditional binary-based methods from detecting this association. By leveraging full phenotype embeddings, our approach can help mitigate these issues by capturing broader phenotype patterns. These results further demonstrate that PERADIGM can effectively identify genetic contributors to Marfan syndrome-specific phenotypes by integrating comprehensive phenotype information. By utilizing phenotype embeddings, PERADIGM enables the discovery of genes that contribute to disease manifestations through shared pathways or overlapping syndromes, even when there is no direct overlap between gene carriers and diagnosed patients.
NF1
Neurofibromatosis type 1 (NF1) [43] is a complex genetic disorder characterized by the formation of multiple benign tumors, primarily neurofibromas, along nerves in the skin, brain, and other tissues. The disorder is primarily caused by mutations in the NF1 gene, which encodes neurofibromin, a tumor suppressor protein that regulates cell growth and division. While NF1 mutations are the main genetic driver, recent studies suggest that additional genes may influence disease severity and phenotypic variability among NF1 patients. Identifying such modifier genes could improve our understanding of NF1 pathogenesis and facilitate more personalized diagnostic and therapeutic strategies. In the UK Biobank 200K European ancestry dataset, there are 62 NF1 cases, limiting the statistical power of traditional rare variant association tests to detect genetic factors beyond NF1 itself. To address this limitation, we applied PERADIGM to identify genetic contributors that may influence NF1-specific phenotypes or modify the clinical presentation, expanding beyond the established role of NF1.
We first examined the intra-group similarity among NF1 patients and rare LoF carriers of the NF1 gene. The UK Biobank 200K European ancestry dataset contained 62 individuals diagnosed with NF1 (ICD-10 code: Q85.0). As shown in Fig 6A, NF1 patients exhibited significantly higher intra-group similarity compared to the control group, indicating distinct phenotypic characteristics. We then analyzed intra-group similarity for 105 individuals carrying rare LoF variants in NF1, among whom only 19 had a documented NF1 diagnosis. This suggests that while NF1 mutations are the primary cause of NF1, many NF1 rare LoF variant carriers exhibit reduced or incomplete penetrance and conversely, some diagnosed patients do not carry detectable NF1 rare LoF variants. As depicted in Fig 6B, the difference in intra-group similarity between NF1 rare LoF carriers and controls was less pronounced than that observed for diagnosed NF1 patients. These findings are consistent with results from ADPKD and Marfan syndrome, suggesting that additional genetic or environmental factors likely influence NF1 disease manifestation beyond NF1 mutations alone.
A: Pairwise similarity within the NF1 patient group and the control group. B: Pairwise similarity within the NF1 variant carrier group and the control group. C: Top 10 significantly associated ICD-10 codes among NF1 patients. D: Top significantly associated ICD-10 codes among NF1 variant carriers.
We examined phenotypes significantly associated with NF1 disease and NF1 rare LoF variant carrier status, identifying 38 phenotypes associated with NF1 disease (Fig 6C). The top 10 significant phenotypes closely aligned with known clinical manifestations, reinforcing the multisystem involvement characteristic of NF1 [44]. These phenotypes primarily affected the nervous system, skin, and skeleton, with some involvement of other organ systems. This comprehensive phenotypic profile enhances our understanding of NF1’s diverse clinical presentation. In contrast, despite a larger number of NF1 rare LoF variant carriers compared to diagnosed NF1 patients, only four phenotypes remained significant after p-value adjustment (Fig 6D). As expected, NF1 disease (Q85.0) itself was the most significant phenotype, followed by D36.1, which is directly related to NF1. Interestingly, the remaining two significant phenotypes (I83.1 and N20.2) are not commonly associated with NF1, suggesting potential novel genotype-phenotype correlations or incidental findings. NF1 patients exhibit a distinct and consistent phenotypic profile, whereas NF1 rare LoF variant carriers present a more variable and diluted phenotype pattern. Apart from the strong association with NF1 disease itself, other phenotype relationships in carriers were less pronounced. This finding highlights the complexity of genotype-phenotype relationships in NF1 and suggests that additional genetic modifiers or environmental factors may influence disease manifestation.
Following our exploration of phenotype patterns in NF1 disease patients and NF1 rare LoF variant carriers, we applied PERADIGM to scan all available genes in our dataset for associations with NF1-specific phenotypes. We then compared our results with those obtained using SKAT-O. As shown in Fig 7A, after p-value adjustment, PERADIGM identified only NF1 as a statistically significant gene associated with NF1-specific phenotypes, consistent with SKAT-O results (Fig 7B). Both methods converged on NF1 as the sole significant genetic factor in our dataset. Similarly, we conducted an additional analysis to refine the representation of the NF1 phenotype among diagnosed patients. Instead of using the entire phenotype profile, we embedded patients only with their NF1-related ICD-10 codes. As shown in Fig 7C, the NF1 gene remained the only one identified, though with a smaller p-value compared to the SKAT-O result. The difficulty in identifying additional significant genes may be due to the relatively diluted phenotypic patterns observed among NF1 rare LoF variant carriers, as noted in our previous analyses. This suggests that while NF1 is the primary genetic driver, the complexity of NF1 disease manifestation may involve additional genetic modifiers or environmental influences that remain undetected with current sample sizes and methods
A: QQ plot of NF1 risk genes identified by PERADIGM. B: QQ plot of NF1 risk genes identified by SKAT-O. C: QQ plot from PERADIGM using pre-selected NF1-related phenotypes.
From the QQ plot, we observed a departure from the central line at the tail using PERADIGM, suggesting potential additional signals. The top 10 genes are listed in Table 3. Among them, P2RX4 encodes a purinergic receptor involved in cell signaling, particularly within the nervous system [45]. Impaired P2X receptor signaling has been reported in microglia with NF1 mutations, highlighting the importance of purinergic receptors in NF1-related neurological abnormalities. Given its role, P2RX4 may contribute to disrupted signaling pathways affecting microglial motility and phagocytosis in NF1 [46]. Although these genes do not reach statistical significance after p-value adjustment, they may still provide valuable insights into genetic factors influencing NF1-specific phenotypes. Further investigation is needed to determine their potential contributions to NF1 pathophysiology.
Furthermore, we examined SPRED1 [47] and LZTR1 [48], two genes previously implicated as potential contributors to NF1-specific phenotypes. Although neither gene reached statistical significance in PERADIGM or SKAT-O after p-value correction, our analysis revealed that SPRED1 had 13 rare LoF variant carriers, none of whom were diagnosed with NF1 in our dataset. Similarly, LZTR1 had 304 rare LoF variant carriers. As shown in Table 3, both SPRED1 and LZTR1 exhibited marginal significance in PERADIGM, whereas SKAT-O identified only LZTR1 as marginally significant. This discrepancy arises because SKAT-O relies heavily on direct overlap between LoF variant carriers and diagnosed patients, whereas PERADIGM can detect associations by leveraging shared phenotypic patterns. For example, despite the lack of overlap between SPRED1 carriers and NF1 patients, PERADIGM still identified a marginal association by capturing phenotypic similarities. These findings further demonstrate the potential of PERADIGM to identify additional genes associated with NF1-specific phenotypes beyond traditional rare variant association methods.
Discussion
Whole exome sequencing (WES) and whole genome sequencing (WGS) have become invaluable tools for elucidating the genetic basis of human diseases, particularly rare disorders. Rare diseases, especially those with Mendelian inheritance patterns, are typically caused by pathogenic rare LoF variants in one or a few genes. Unlike common variants, rare variants often exert larger effect sizes on phenotypes, but their carriers are relatively few, even in large population cohorts such as the UK Biobank. To address the challenges of identifying these rare pathogenic variants, several rare variant association methods have been developed, including the Burden test, SKAT, and SKAT-O. These approaches improve detection power by aggregating multiple rare variants within genes or regions, making them more effective than traditional genome-wide association studies (GWAS). However, the application of these methods to rare diseases remains limited by the scarcity of affected individuals and the low frequency of causal variants. Additionally, in large biobank datasets, some individuals may have undiagnosed conditions, further reducing statistical power.
In this paper, we show that rare disease patients often exhibit distinct phenotype patterns that extend beyond their primary diagnosis. However, existing methods have not fully leveraged this comprehensive phenotypic information, focusing primarily on binary disease status. By incorporating broader phenotypic profiles, rather than restricting analyses to disease presence or absence, we may enhance our ability to detect and characterize genetic contributors to rare diseases, even in the face of limited sample sizes and missing diagnostic information.
To better utilize phenotype data, we developed PERADIGM, a framework that integrates natural language processing (NLP) techniques to extract and leverage detailed phenotype information from each individual. Unlike traditional methods that rely solely on binary disease status, PERADIGM incorporates phenotype similarity into gene identification, improving sensitivity to genetic contributors associated with disease-specific phenotypes. Through multiple case studies, we demonstrate that this approach enhances statistical power and identifies additional candidate genes that conventional methods may overlook.
We applied PERADIGM to three rare diseases: autosomal dominant polycystic kidney disease (ADPKD), Marfan syndrome, and neurofibromatosis type 1 (NF1). Each of these disorders has well-established major causative genes: PKD1 and PKD2 for ADPKD, FBN1 for Marfan syndrome, and NF1 for NF1 disease. While traditional rare variant association test methods, such as the Burden test, SKAT, and SKAT-O successfully identified these primary genes in UK Biobank data, they failed to detect additional genes with smaller effects. In contrast, PERADIGM identified a broader set of genes associated with disease-specific phenotypes, some of which have supporting evidence from prior studies. To further enhance interpretability, we conducted an additional analysis to refine phenotype embeddings for diagnosed patients. Instead of embedding the full phenotype profile, we restricted embeddings to disease-related phenotypes based on expert knowledge. This targeted approach provided more explainable results by ensuring that identified genes were associated specifically with disease-relevant phenotypes rather than broader, unrelated clinical features. Our findings demonstrate that even with this stricter phenotype, PERADIGM maintained robust performance and uncovered additional associated genes. This further supports the utility of phenotype-driven gene discovery for rare diseases. For ADPKD, PERADIGM identified seven such genes, expanding beyond PKD1 and PKD2. For Marfan syndrome, PERADIGM identified eight genes, significantly broadening the genetic landscape beyond FBN1. In the case of NF1, PERADIGM identified NF1 as the primary gene, but marginally significant findings suggest the potential to detect additional contributors. These results illustrate PERADIGM’s ability to incorporate extensive phenotype information, leading to the identification of a more comprehensive set of genes associated with disease-specific phenotypes.
The primary innovation of PERADIGM lies in its shift from traditional binary disease-gene associations to a similarity-based framework that integrates comprehensive phenotypic information. This approach uses NLP-based embedding models to cluster individuals based on shared phenotypic patterns rather than relying on simple case-control definitions. To address potential concerns of “double-dipping”, we emphasize that ICD-10 codes in PERADIGM serve distinct roles in disease definition and phenotype similarity scoring. The target disease codes are used solely for cohort identification, while the similarity analysis leverages comprehensive EHR embeddings that capture each individual’s overall phenotypic profile. This design ensures that similarity reflects broad health patterns rather than redundant use of disease-defining information, minimizing potential information leakage. Beyond identifying genes associated with disease-relevant phenotypes, this similarity-based method can also refine disease cohort definitions and characterize individuals with distinct phenotypic profiles. In cases where patient groups are too small for traditional analyses, our method enables the identification of novel phenotype clusters, allowing for more robust investigations of disease mechanisms. By focusing on phenotypic similarity rather than individual variant effects, PERADIGM provides a powerful tool for uncovering genetic associations in rare diseases. Additionally, PERADIGM’s ability to identify genes relevant to related but distinct diseases with phenotypic overlap underscores its potential beyond rare diseases. This framework could be applied to complex traits where multiple conditions share genetic and clinical features, facilitating the discovery of novel disease subtypes and genetic modifiers.
Furthermore, PERADIGM can help infer relationships between genes and phenotypes by analyzing carriers’ phenotypic profiles, complementing traditional pathway-based analyses. Future improvements to the framework could involve transitioning from static embedding models like Word2Vec to context-aware models such as BERT-based architectures or large language models (LLMs). These advanced models could capture more nuanced longitudinal relationships from sequential ICD-10 code records, providing a richer representation of individual phenotypic trajectories. The success of PERADIGM highlights how integrating phenotype similarity-based embeddings into genetic research can advance our understanding of rare diseases, improve diagnostic precision, inform targeted therapies, and ultimately enhance patient outcomes.
We note some limitations of PERADIGM that can be addressed in future reserach. First, currently, PERADIGM was applied only to individuals of European ancestry from the UK Biobank to minimize population heterogeneity and confounding. Extending the framework to multi-ancestry datasets is an important next step toward improving generalizability and equity in rare-variant discovery. A practical extension would involve ancestry-stratified analyses followed by meta-analysis across groups, allowing ancestry-specific calibration of phenotype embeddings and carrier frequencies. Although current non-European sample sizes in the UK Biobank limit such analyses, future applications in larger and more diverse biobanks could incorporate ancestry-specific weighting or frequency-based scaling to better capture shared and ancestry-unique genetic architectures. Second, we have focused on rare LoF variants because they have clear biological interpretation and consistent definitions across annotation tools such as ANNOVAR, VEP, and FAVOR [49]. LoF variants typically exert the strongest effect sizes by directly disrupting protein function, making them ideal for initial validation of the PERADIGM framework. In contrast, deleterious missense variants are more common, have smaller and less predictable effects, and their annotation varies considerably across prediction models like CADD, REVEL, or SIFT, which may dilute true pathogenic signals. Nonetheless, expanding PERADIGM to incorporate additional functional annotations represents an important future direction. By integrating variant-type–specific weighting and comprehensive annotation resources such as FAVOR, PERADIGM could capture broader functional variation and enhance power while maintaining biological interpretability. Third, our study has relied on ICD-10 data from the UK Biobank. However, we acknowledge that coding practices, healthcare systems, and EHR structures differ across institutions and countries, and embeddings trained on the UK Biobank data may not directly generalize to other populations. Future extensions of PERADIGM will focus on improving cross-cohort transferability by incorporating multi-biobank data and applying transfer-learning or fine-tuning strategies to recalibrate embeddings for new healthcare systems. Importantly, because PERADIGM operates on structured ICD-10 codes, it remains inherently portable once appropriate recalibration is performed.
Materials and methods
UK Biobank 200K dataset and genotype data quality control
We utilized the UK Biobank 200K dataset [4,50] in our analysis. To mitigate potential confounding effects due to population stratification, we restricted our analysis to individuals of European ancestry. We applied several filtering criteria to ensure data quality and completeness. Specifically, we excluded individuals who lacked Hospital inpatient data or whole-exome sequencing data. To eliminate potential biases due to sample relatedness, we also removed related individuals from the UK Biobank dataset. Specifically, we used the kinship coefficients provided by UK Biobank to identify all pairs of related participants and excluded individuals with first- or second-degree relationships. Within each related family, one individual was randomly retained to ensure that the final analysis cohort consisted entirely of unrelated samples. After applying these filters, our final study cohort comprised 121,330 individuals of European ancestry. This refined dataset ensures a more homogeneous population with comprehensive phenotypic and genetic information, thereby enhancing the reliability and interpretability of our subsequent analyses.
To ensure the quality and reliability of the rare variant analysis, we implemented stringent quality control (QC) measures on the whole-exome sequencing (WES) data using PLINK [51]. Our QC protocol comprised several steps. We first removed variants with a minor allele frequency (MAF) exceeding 0.01, thereby retaining only rare variants for subsequent analysis. We then excluded variants that significantly deviated from the Hardy-Weinberg equilibrium (HWE), with a threshold of p < 1e-6. A significant departure from HWE can indicate genotyping errors or population stratification. Furthermore, we removed samples with missing sex information. Following these QC steps, we identified all rare predicted loss of function (pLoF) variants. These included stop-gain, stop-loss, frameshift insertion, frameshift deletion, and essential splice variants. This allowed us to focus on potentially impactful genetic alterations [52]. To reduce the impact of somatic mutations, we added a filtering step. Although UK Biobank WES data are derived from blood samples, prior studies [53] have shown that some rare variants may reflect somatic rather than germline mutations. Since somatic variants are more frequent in older individuals, we compared the age distributions of rare LoF variant carriers and non-carriers for each gene using the Wilcoxon rank-sum test, followed by Bonferroni correction. Genes whose carriers were significantly older were flagged as potential CHIP-related signals. Three genes, ASXL1, TET2, and DNMT3A, met this criterion and were removed from our genome-wide analyses. This filtering step enhances the robustness of our results by minimizing confounding from age-related somatic mutations and comorbidity-driven phenotypic similarity.
Embedding model
Embedding learning and semantic matching have a rich history in natural language processing. Mikolov et al. introduced the continuous Bag-of-Words (CBOW) and Skip-gram models, collectively known as Word2Vec, to represent words in a vector space [22]. These neural network-based models achieve state-of-the-art performance in measuring syntactic and semantic word similarity. Many studies have leveraged Word2Vec to embed medical concepts for patients using Electronic Health Record (EHR) corpora, directly utilizing textual medical records as input.
In our embedding analysis, we applied the CBOW Word2Vec model to embed the ICD-10 phenotype data from the UK Biobank. The CBOW model is an architecture used in word embedding where the surrounding context words are used to predict a target word. In CBOW, a window of context words around a target word is input to the model, and the goal is to predict the central (target) word from these context words. The model optimizes weights to maximize the probability of predicting the correct target word based on the given context. This approach is particularly well-suited for embedding ICD-10 codes based on the surrounding diagnosis records of individual patients. It effectively captures the contextual relationships between phenotypes, enabling the representation of medical concepts in a dense, continuous vector space. By leveraging these contextual relationships, the model can potentially uncover latent patterns and associations within the medical data that might not be immediately apparent through traditional analysis methods.
Embedding for ICD-10 codes
We aim to map phenotypes onto a static high-dimensional space, enabling numerical representation of their characteristics and facilitating similarity measurements between different phenotypes. We employed the Word2Vec embedding model to obtain static embedding vectors for the phenotypes. In the UK Biobank database, hospital inpatient data are recorded in the form of ICD-10 codes, each representing a specific disease. For instance, Q61.2 denotes, Polycystic kidney, adult type. Consequently, each patient’s record yields a sequence of ICD-10 codes, comprehensively describing their longitudinal inpatient condition. Based on this information, we developed the following approach to embedding the ICD-10 codes. For each ICD-10 code, we extracted keywords from its description, excluding punctuation and English stopwords. This process generated a vector of words for each patient based on his/her ICD-10 descriptions, with each word serving as a token in the training dataset. Each patient’s ICD-10 description was treated as a sentence. The output comprised embedding vectors for each word appearing in the inpatient dataset. We then employed an average embedding to derive the embedding vector for each ICD-10 code. This approach not only captured sequential information between different description tokens for each ICD-10 code but also leveraged detailed word tokens when two ICD-10 codes contain similar words.
Embedding for individual using ICD-10 codes
After obtaining the embedding vector for each ICD-10 code in the UK Biobank data, we proceeded to embed individuals using a weighted average embedding approach. This method enhances simple averaging of ICD-10 code embedding vectors by focusing only on less common ICD-10 codes (frequency ) and assigning differential weights to these codes. We assigned weights by considering two pieces of information for an ICD code: (1) the significance level of each code with respect to the target disease, a process to be detailed in subsequent sections, and (2) the information content of each ICD-10 code, derived from its frequency.
For risk gene mapping, we focus on a specific disease of interest. Different phenotypes have varying degrees of associations with the target disease. Consequently, simply averaging all available ICD-10 embeddings to represent an individual’s overall phenotype embedding may be less effective to capture its relevance to the target disease. A more informative approach to characterizing an individual’s phenotypic profile in relation to a target disease involves assigning differential weights to distinct ICD-10 codes, with higher weights indicating a stronger relationship with the target disease. Target disease relevance is captured through the logistic regression model:
where cij denotes the i-th ICD-10 code status of individual j, dj represents the target disease status of individual j, and zj is the vector of covariates, specifically age and sex. We conducted a comprehensive scan of all ICD-10 codes in the UK Biobank inpatient dataset for the disease of interest, calculating a p-value for each ICD-10 code. Subsequently, we utilized these p-values to derive the weight for each ICD-10 code. By employing this weighting scheme, ICD-10 codes exhibiting a more significant association with the disease of interest are assigned larger weights, thereby playing a more prominent role in the individual embedding process. We also considered each ICD-10 code’s prevalence, denoted as rk, in the embedding for each individual.
This formulation of information content assigns higher values to rarer phenotypes, reflecting our assumption that rarer phenotypes carry more information about rare diseases.
The weight wk for the k-th ICD-10 code is then calculated as:
where the p-value represents the significance level of the k-th ICD-10 code in relation to the disease of interest. We use a sigmoid function to map the p-value onto a [0.5, 1] scale, with greater weights indicating phenotypes more relevant to the disease of interest.
The rationale behind this weighted average embedding is twofold. First, it aims to capture the differential importance of various phenotypes in relation to specific diseases, acknowledging that different phenotypes contribute distinctly to each disease. Second, it assigns more weight to rare ICD-10 codes to prevent common ICD-10 codes from overwhelming the information provided by less frequent, but potentially more informative, codes. This approach is based on the assumption that for rare diseases, the rare phenotypes each individual exhibits will carry more information about the disease. Consequently, this method ensures a more comprehensive and disease-specific representation of each individual’s phenotypic profile, potentially enhancing the detection of subtle disease associations. With such defined weights, the embedding vector for an individual i is calculated as:
where Di is the embedding vector for individual i, ni is the total number of less common ICD-10 codes (frequency ) recorded in individual i, ek is the embedding vector of the k-th ICD-10 code, and wk is the weight of the ICD-10 code ek. This equation shows how we represent an individual using weighted ICD-10 code embeddings.
Risk gene mapping
Based on the phenotype embedding of each individual, we can assign each gene a risk score for a disease of interest by investigating the similarity of the phenotype embeddings from rare LoF variant carriers with those having the target disease. A larger risk score indicates a higher likelihood that the candidate gene is associated with the disease of interest. Unlike other gene prioritization tools based on pathways or other biological criteria, we calculated the risk score solely based on phenotype similarities. This is accomplished in four steps. First, we extracted all the rare Loss of Function (LoF) variant carriers for each candidate gene, along with their genotype and phenotype information. Second, we extracted all individuals diagnosed with the target disease and their phenotype information. Third, we used the method described in the previous section to embed the LoF set carriers and disease patients, obtaining their embedding representations. Fourth, we calculated the disease risk score by comparing the phenotype similarity at the individual level between the different genes’ LoF set carriers and the diseased individuals.
Our underlying assumption is that greater similarity between disease patients and gene LoF set carriers at the phenotype embedding level indicates higher pathogenicity of the gene for the disease of interest. This is based on the hypothesis that rare LoF variants have a large effect size on the phenotype compared to common variants. Notably, we considered not only the disease status as a binary value but all relevant phenotype information related to the disease, providing a more comprehensive and reasonable approach. Finally, we ranked each gene’s risk score based on the calculated average similarity score.
where Si is the average similarity score for individual i compared to the disease patients group, m is the number of patients in the disease of interest comparison group, D is the embedding vector for each individual, and the function represents cosine similarity, commonly used in calculating the similarity between two embedding vectors. Based on the similarity score for each LoF mutation carrier, we can calculate the overall score for each candidate gene as
where is the risk score of gene i to the disease of interest, and
is number of LoF carriers for gene i.
For each observed risk score, we employed random sampling to determine its significance level. Under the null hypothesis (H0), we derived an empirical distribution of the risk score for the target gene by repeatedly randomly selecting ni samples from the UK Biobank 200K inpatient dataset and computed the risk score for each group. This process was repeated 10,000 times to construct the empirical risk score distribution for gene i under the null hypothesis for the target disease.
The estimated p-value for the observed risk score is derived by calculating the Z-score using the formula:
where is the observed risk score,
is the average of the risk scores from randomly sampled sets of individuals, and σ is the standard deviation of the simulated risk scores. The p-value is then computed from the standard normal distribution. We applied a one-tailed test with a significance level of 0.05 and used the Bonferroni correction to control for multiple testing, assessing whether the gene’s risk score is significantly higher than expected by chance.
Intra-group similarity calculation
To elucidate the overall phenotype patterns within the disease group and among carriers of rare LoF variants of a candidate gene, we calculated the intra-group phenotype similarity in the embedding space. This analysis aimed to explore whether individuals within these groups share greater phenotypic similarity compared to randomly selected individuals from the UK Biobank.
First, we considered all individual pairs within the target group and calculated similarity scores for each pair within the target group. We then randomly selected an equal number of individuals from the UK WES 200K excluding the target group, and calculated pairwise similarity scores.
A difference in the similarity score box plot of the two groups would suggest that the given group exhibits a distinct intra-group phenotype pattern compared to the overall dataset. For example, this analysis could reveal whether carriers of rare LoF variants in the PKD1 gene demonstrate greater phenotypic similarity to each other than a randomly selected control group from the UK Biobank.
Disease cohort definition
For each target disease, we defined the case cohort based on the corresponding International Classification of Diseases, 10th Revision (ICD-10) codes recorded in the UK Biobank electronic health records. Individuals with at least one record of the disease-specific ICD-10 code were included as cases, while all other participants were considered part of the background control population. This definition ensures that cohort membership is determined directly from standardized diagnostic codes without requiring additional phenotype modeling or manual curation. Specifically, autosomal dominant polycystic kidney disease (ADPKD) cases were identified using ICD-10 codes Q61.2 and Q61.3; Marfan syndrome cases were identified using Q87.4; and neurofibromatosis type 1 (NF1) cases were identified using Q85.0. For each disease, individuals carrying any of these codes were assigned to the corresponding disease cohort. Control individuals were drawn from the remaining participants who did not possess the disease-specific ICD-10 codes. All disease and control cohorts were restricted to unrelated individuals of European ancestry, as defined by UK Biobank-provided kinship coefficients and ancestry principal components, to minimize population structure and relatedness confounding.
Type I error rate control simulation
Assessing the type I error rate of PERADIGM is essential for establishing the reliability of our framework. We conducted comprehensive simulations to evaluate calibration under the null hypothesis. For each disease (ADPKD, Marfan syndrome, and NF1), we used the observed distribution of target-disease similarity scores to generate simulated datasets. In each replicate, individuals were randomly assigned to synthetic “gene carrier groups” that matched the observed carrier-count distribution of real genes (denoted Ng). A pseudo gene-level risk score was then calculated as the average similarity score across individuals within each simulated group, and the same random-sampling association test used in the real analysis was applied to obtain null p-values. We assessed type I error control through four complementary analyses: (1) QQ plots were drawn to visually inspect inflation or deflation in the null p-value distribution. (2) Empirical type I error rates were computed as the proportion of null p-values below nominal thresholds (e.g., 0.05 and 0.01). (3) Kolmogorov–Smirnov (KS) tests were performed to evaluate deviations from the expected uniform distribution. (4) The genomic inflation factor (λ) was calculated to measure systematic inflation or deflation, with values near 1 indicating proper calibration. All simulation results are provided in the S1 Text, S1 Table, and S1 Fig.
Alternative weighting schemes analysis
To evaluate the impact of weighting schemes on individual embeddings, we conducted additional analyses to systematically assess how different weighting strategies influence the PERADIGM framework. We compared our original weighting scheme with four alternative strategies: (1) ×
, (2)
, (3)
, and (4)
. Together with the original approach, these five schemes incorporate varying combinations of disease-association strength (via p-values or effect sizes) and phenotype information content (IC). The additional schemes were designed to disentangle the relative contribution of association strength and information content to the weighting mechanism in PERADIGM. For each scheme, we generated corresponding ICD-10 code weights and evaluated pairwise consistency among them by computing Pearson correlation coefficients across all weighting matrices. This analysis allowed us to assess whether the different formulations capture similar disease-relevant information or emphasize distinct aspects of the phenotype space. In addition, we re-ran the PERADIGM pipeline using each weighting scheme to examine how weighting choice influences downstream gene-level results and overall calibration. The consistency of significant findings across these weighting variants provides an indication of PERADIGM’s robustness to the specific choice of weighting formulation. All the alternative weighting scheme analysis results are provided in the S1 Text, S2 Fig, and S3 Fig.
Supporting information
S1 Text. Supplementary materials.
Additional methodological details and extended statistical analyses.
https://doi.org/10.1371/journal.pgen.1011976.s001
(PDF)
S1 Table. Summary of variants before and after quality control and LoF annotation.
Each row corresponds to a chromosome and reports the number of raw genotype variants, the number of variants retained after quality control, and the number of annotated loss-of-function (LoF) variants. Totals across chromosomes are shown in the final row.
https://doi.org/10.1371/journal.pgen.1011976.s002
(PDF)
S2 Table. Summary of Type I error rate control simulation results.
Each row reports the genomic inflation factors (λ and ), Kolmogorov–Smirnov (KS) test p-values, and empirical type I error rates at
and
, with 95% binomial confidence intervals in parentheses.
https://doi.org/10.1371/journal.pgen.1011976.s003
(PDF)
S3 Table. Summary gene-level results from PERADIGM
across three diseases. This file contains the full gene-level association results for ADPKD, Marfan syndrome, and NF1.
https://doi.org/10.1371/journal.pgen.1011976.s004
(XLSX)
S1 Fig. QQ plot of type I error rate simulation.
QQ plots of simulated null p-values for ADPKD, NF1, and Marfan syndrome show well-calibrated type I error control, with observed values closely following the expected null distribution.
https://doi.org/10.1371/journal.pgen.1011976.s005
(TIF)
S2 Fig. Correlation of alternative weighting schemes.
A heatmap showing pairwise Pearson correlation coefficients among the five weighting schemes used in PERADIGM. All schemes except the IC-only method exhibit high correlations, indicating consistent information patterns across ICD-10 codes.
https://doi.org/10.1371/journal.pgen.1011976.s006
(TIF)
S3 Fig. Robustness of PERADIGM under alternative weighting schemes.
QQ plots of gene-level p-values across different weighting strategies demonstrate that, except for the IC-only weighting, all methods yield nearly identical results, confirming the robustness of PERADIGM to the choice of weighting scheme.
https://doi.org/10.1371/journal.pgen.1011976.s007
(TIF)
S4 Fig. Comparison of disease and non-disease group risk scores.
A. Histogram of gene risk scores for individuals diagnosed with ADPKD and controls. B. Histogram of gene risk scores for individuals diagnosed with Marfan syndrome and controls. B. Histogram of gene risk scores for individuals diagnosed with NF1 and controls.
https://doi.org/10.1371/journal.pgen.1011976.s008
(TIF)
S5 Fig. Histogram of intra-group similarity score between diseases and gene rare LoF carriers.
A. Histogram of pairwise similarity score analysis for ADPKD. B. Histogram of pairwise similarity score analysis for Marfan. C. Histogram of pairwise similarity score analysis for NF1.
https://doi.org/10.1371/journal.pgen.1011976.s009
(TIF)
S6 Fig. Violin plot of risk scores by groups.
A. Distribution of risk scores for individuals carrying rare LoF variants in genes significantly associated with ADPKD. The “All” group represents the risk score distribution of all individuals in the dataset with respect to ADPKD-specific phenotypes. B. Distribution of risk scores for individuals carrying rare LoF variants in genes significantly associated with Marfan syndrome. The “All” group reflects the risk score distribution for the entire cohort based on Marfan syndrome-specific phenotypes. C. Distribution of risk scores for individuals carrying rare LoF variants in genes significantly associated with NF1 disease. The “All” group shows the risk score distribution across all individuals with respect to NF1-specific phenotypes.
https://doi.org/10.1371/journal.pgen.1011976.s010
(TIF)
Acknowledgments
This research was conducted using data from the UK Biobank Resource (Application ID: 29900). We gratefully acknowledge the UK Biobank participants, whose generous contributions made this study possible.
References
- 1. Halldorsson BV, Eggertsson HP, Moore KHS, Hauswedell H, Eiriksson O, Ulfarsson MO, et al. The sequences of 150,119 genomes in the UK Biobank. Nature. 2022;607(7920):732–40. pmid:35859178
- 2. Wang Q, Dhindsa RS, Carss K, Harper AR, Nag A, Tachmazidou I, et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature. 2021;597(7877):527–32. pmid:34375979
- 3. Patrick MT, Bardhi R, Zhou W, Elder JT, Gudjonsson JE, Tsoi LC. Enhanced rare disease mapping for phenome-wide genetic association in the UK Biobank. Genome Med. 2022;14(1):85. pmid:35945607
- 4. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. pmid:30305743
- 5. Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet. 2014;95(1):5–23. pmid:24995866
- 6. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. pmid:21737059
- 7. Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–75. pmid:22699862
- 8. Hateboer N, v Dijk MA, Bogdanova N, Coto E, Saggar-Malik AK, San Millan JL, et al. Comparison of phenotypes of polycystic kidney disease types 1 and 2. European PKD1-PKD2 Study Group. Lancet. 1999;353(9147):103–7. pmid:10023895
- 9. Dean JCS. Marfan syndrome: clinical diagnosis and management. Eur J Hum Genet. 2007;15(7):724–33. pmid:17487218
- 10. Rasmussen SA, Friedman JM. NF1 gene and neurofibromatosis 1. Am J Epidemiol. 2000;151(1):33–40. pmid:10625171
- 11. Groza T, Köhler S, Moldenhauer D, Vasilevsky N, Baynam G, Zemojtel T, et al. The human phenotype ontology: semantic unification of common and rare disease. Am J Hum Genet. 2015;97(1):111–24. pmid:26119816
- 12. Gorlov IP, Gorlova OY, Frazier ML, Spitz MR, Amos CI. Evolutionary evidence of the effect of rare variants on disease etiology. Clin Genet. 2011;79(3):199–206. pmid:20831747
- 13. Häyrinen K, Saranto K, Nykänen P. Definition, structure, content, use and impacts of electronic health records: a review of the research literature. Int J Med Inform. 2008;77(5):291–304. pmid:17951106
- 14.
Anker SD, Morley JE, von Haehling S. Welcome to the ICD-10 code for sarcopenia. 2016.
- 15. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90(1):7–24. pmid:22243964
- 16. Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26(9):1205–10. pmid:20335276
- 17. Guo MH, Plummer L, Chan Y-M, Hirschhorn JN, Lippincott MF. Burden testing of rare variants identified through exome sequencing via publicly available control data. Am J Hum Genet. 2018;103(4):522–34. pmid:30269813
- 18. Bastarache L, Hughey JJ, Hebbring S, Marlo J, Zhao W, Ho WT, et al. Phenotype risk scores identify patients with unrecognized Mendelian disease patterns. Science. 2018;359(6381):1233–9. pmid:29590070
- 19. Xu D, Wang C, Khan A, Shang N, He Z, Gordon A, et al. Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies. NPJ Digit Med. 2021;4(1):116. pmid:34302027
- 20. Detrois KE, Hartonen T, Teder-Laving M, Jermy B, Läll K, Yang Z, et al. Cross-biobank generalizability and accuracy of electronic health record-based predictors compared to polygenic scores. Nat Genet. 2025;57(9):2136–45. pmid:40866628
- 21. Zeng Z, Deng Y, Li X, Naumann T, Luo Y. Natural language processing for EHR-based computational phenotyping. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(1):139–53. pmid:29994486
- 22.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint 2013. https://arxiv.org/abs/1301.3781
- 23. Mukherjee S, McCaw ZR, Pei J, Merkoulovitch A, Soare T, Tandon R, et al. EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery. Bioinform Adv. 2024;4(1):vbae135. pmid:39664859
- 24. Yun T, Cosentino J, Behsaz B, McCaw ZR, Hill D, Luben R, et al. Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction. Nat Genet. 2024;56(8):1604–13. pmid:38977853
- 25. Zhai W, Huang X, Shen N, Zhu S. Phen2Disease: a phenotype-driven model for disease and gene prioritization by bidirectional maximum matching semantic similarities. Brief Bioinform. 2023;24(4):bbad172. pmid:37248747
- 26. Zhao M, Havrilla JM, Fang L, Chen Y, Peng J, Liu C, et al. Phen2Gene: rapid phenotype-driven gene prioritization for rare diseases. NAR Genom Bioinform. 2020;2(2):lqaa032. pmid:32500119
- 27. Deng Y, Gao L, Wang B, Guo X. HPOSim: an R package for phenotypic similarity measure and enrichment analysis based on the human phenotype ontology. PLoS One. 2015;10(2):e0115692. pmid:25664462
- 28. Gabow PA. Autosomal dominant polycystic kidney disease. N Engl J Med. 1993;329(5):332–42. pmid:8321262
- 29. Torres VE, Harris PC, Pirson Y. Autosomal dominant polycystic kidney disease. Lancet. 2007;369(9569):1287–301. pmid:17434405
- 30. Harris PC, Torres VE. Genetic mechanisms and signaling pathways in autosomal dominant polycystic kidney disease. J Clin Invest. 2014;124(6):2315–24. pmid:24892705
- 31. Senum SR, Li YSM, Benson KA, Joli G, Olinger E, Lavu S, et al. Monoallelic IFT140 pathogenic variants are an important cause of the autosomal dominant polycystic kidney-spectrum phenotype. Am J Hum Genet. 2022;109(1):136–56. pmid:34890546
- 32. Apple B, Sartori G, Moore B, Chintam K, Singh G, Anand PM, et al. Individuals heterozygous for ALG8 protein-truncating variants are at increased risk of a mild cystic kidney disease. Kidney Int. 2023;103(3):607–15. pmid:36574950
- 33. Besse W, Chang AR, Luo JZ, Triffo WJ, Moore BS, Gulati A, et al. ALG9 mutation carriers develop kidney and liver cysts. J Am Soc Nephrol. 2019;30(11):2091–102. pmid:31395617
- 34. Cornec-Le Gall E, Chebib FT, Madsen CD, Senum SR, Heyer CM, Lanpher BC, et al. The value of genetic testing in polycystic kidney diseases illustrated by a family with PKD2 and COL4A1 mutations. Am J Kidney Dis. 2018;72(2):302–8. pmid:29395486
- 35.
Pyeritz RE. Marfan syndrome. Emery and Rimoin’s Principles and Practice of Medical Genetics and Genomics. Elsevier; 2025. p. 3–49. https://doi.org/10.1016/b978-0-12-812531-1.00004-x
- 36. Loeys BL, Dietz HC, Braverman AC, Callewaert BL, De Backer J, Devereux RB, et al. The revised Ghent nosology for the Marfan syndrome. J Med Genet. 2010;47(7):476–85. pmid:20591885
- 37. Marelli S, Micaglio E, Taurino J, Salvi P, Rurali E, Perrucci GL, et al. Marfan syndrome: enhanced diagnostic tools and follow-up management strategies. Diagnostics (Basel). 2023;13(13):2284. pmid:37443678
- 38. Singh KK, Rommel K, Mishra A, Karck M, Haverich A, Schmidtke J, et al. TGFBR1 and TGFBR2 mutations in patients with features of Marfan syndrome and Loeys-Dietz syndrome. Hum Mutat. 2006;27(8):770–7. pmid:16799921
- 39. Robinson PN, Booms P, Katzke S, Ladewig M, Neumann L, Palz M, et al. Mutations of FBN1 and genotype-phenotype correlations in Marfan syndrome and related fibrillinopathies. Hum Mutat. 2002;20(3):153–61. pmid:12203987
- 40. Sakai LY, Keene DR, Renard M, De Backer J. FBN1: The disease-causing gene for Marfan syndrome and other genetic disorders. Gene. 2016;591(1):279–91. pmid:27437668
- 41. Callewaert B, Malfait F, Loeys B, De Paepe A. Ehlers-Danlos syndromes and Marfan syndrome. Best Pract Res Clin Rheumatol. 2008;22(1):165–89. pmid:18328988
- 42. Mizuguchi T, Collod-Beroud G, Akiyama T, Abifadel M, Harada N, Morisaki T, et al. Heterozygous TGFBR2 mutations in Marfan syndrome. Nat Genet. 2004;36(8):855–60. pmid:15235604
- 43. Gutmann DH, Ferner RE, Listernick RH, Korf BR, Wolters PL, Johnson KJ. Neurofibromatosis type 1. Nat Rev Dis Primers. 2017;3:17004. pmid:28230061
- 44. Pasmant E, Vidaud M, Vidaud D, Wolkenstein P. Neurofibromatosis type 1: from genotype to phenotype. J Med Genet. 2012;49(8):483–9. pmid:22889851
- 45. Montilla A, Mata GP, Matute C, Domercq M. Contribution of P2X4 receptors to CNS function and pathophysiology. Int J Mol Sci. 2020;21(15):5562. pmid:32756482
- 46. Kuhrt LD, Motta E, Elmadany N, Weidling H, Fritsche-Guenther R, Efe IE, et al. Neurofibromin 1 mutations impair the function of human induced pluripotent stem cell-derived microglia. Dis Model Mech. 2023;16(12):dmm049861. pmid:37990867
- 47. Brems H, Chmara M, Sahbatou M, Denayer E, Taniguchi K, Kato R, et al. Germline loss-of-function mutations in SPRED1 cause a neurofibromatosis 1-like phenotype. Nat Genet. 2007;39(9):1120–6. pmid:17704776
- 48. Farncombe KM, Thain E, Barnett-Tapia C, Sadeghian H, Kim RH. LZTR1 molecular genetic overlap with clinical implications for Noonan syndrome and schwannomatosis. BMC Med Genomics. 2022;15(1):160. pmid:35840934
- 49. Zhou H, Arapoglou T, Li X, Li Z, Zheng X, Moore J, et al. FAVOR: functional annotation of variants online resource and annotator for variation across the human genome. Nucleic Acids Res. 2023;51(D1):D1300–11. pmid:36350676
- 50. Szustakowski JD, Balasubramanian S, Kvikstad E, Khalid S, Bronson PG, Sasson A, et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat Genet. 2021;53(7):942–8. pmid:34183854
- 51. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. pmid:17701901
- 52. Xie Y, Acosta JN, Ye Y, Demarais ZS, Conlon CJ, Chen M, et al. Whole-exome sequencing analyses support a role of Vitamin D metabolism in ischemic stroke. Stroke. 2023;54(3):800–9. pmid:36762557
- 53. Vlasschaert C, Mack T, Heimlich JB, Niroula A, Uddin MM, Weinstock J, et al. A practical approach to curate clonal hematopoiesis of indeterminate potential in human genetic data sets. Blood. 2023;141(18):2214–23. pmid:36652671