Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Unique insights from by mining protein mutations and RSids in addition to applying the Human Phenotype Ontology

Unique insights from by mining protein mutations and RSids in addition to applying the Human Phenotype Ontology

  • Shray Alag


Researchers and clinicians face a significant challenge in keeping up-to-date with the rapid rate of new associations between genetic mutations and diseases. To remedy this problem, this research mined the corpus to extract relevant biological insights, produce unique reports to summarize findings, and make the meta-data available via APIs. An automated text-analysis pipeline performed the following features: parsing the files, extracting and analyzing mutations from the corpus, mapping clinical trials to Human Phenotype Ontology (HPO), and finding associations between clinical trials and HPO nodes. Unique reports were created for each mutation (SNPs and protein mutations) mentioned in the corpus, as well as for each clinical trial that references a mutation. These reports, which have been run over multiple time points, along with APIs to access meta-data, are freely available at Additionally, HPO was used to normalize disease terms and associate clinical trials with relevant genes. The creation of the pipeline and reports, the association of clinical trials with HPO terms, and the insights, public repository, and APIs produced are all novel in this work. The freely-available resources present relevant biological information and novel insights between biomedical entities in a robust and accessible manner, mitigating the challenge of being informed about new associations between mutations, genes, and diseases.


The rapid decrease in the cost of Next-Gen Sequencing (NGS) over the past decade has led to a multitude of new NGS-based studies. Frequently, these studies associate genomic mutations—such as protein mutations and Single Nucleotide Polymorphisms [1] (SNPs)—with genes, drugs, diseases, and other phenotypes [2]. Knowledge about new associations is crucial for researchers and clinicians since understanding an individual’s genetic mutations can help identify disease risk, improve prognosis, and tailor personalized treatments [3][4]. It is currently cumbersome to keep up with the rapid rate of discoveries; however, since manual efforts to curate the literature are highly time-consuming., run by the United States National Library of Medicine, contains more than 330,000 text documents detailing both past and present clinical trials globally [5]. A proportion of these trials includes information on SNPs, protein mutations, and genes.

Many previous researchers have effectively mined the clinical trials corpus to gain new insights: Zhang et al. 2019 [6] maps Laboratory Observation Identifier Names and Codes (LOINC [7]) to Human Phenotype Ontology (HPO [8]) terms; Gandy et al. 2017 [9] develop CTMine, which uses regular expressions for gene names to search clinical trials; Xu et al. 2016 [10] curates genetic alterations in cancer clinical trials; Su and Sanger, 2017 [11] mine to develop a novel method of drug repositioning; Pradhan et al. 2018 [12] conduct a meta-analysis by automatically extracting data from; and Sfakianaki et al. 2015 [13] use a Natural Language Processing (NLP) framework to mine

However, despite these important advances, mapping clinical trials to HPO terms, extracting protein mutations and SNPs [14] across the corpus, and creating mutation-specific and clinical-trials-specific reports remain feats not yet accomplished.

This study analyzes with six specific goals:

  1. Develop a Natural Language Processing based pipeline that extracts SNPs and protein mutations instances from free text, maps their clinical trial annotations to standardized biological terms using HPO and MeSH [15] ontologies, and analyzes the complete corpus to extract new insights between mutations and diseases in the clinical trials literature.
  2. Generate unique reports, made freely available online, for each of the extracted mutations. These reports should contain the context in which the mutation is mentioned across all clinical trials, along with the associated HPO disease terms. Further, HPO annotations [16] should be used to reference other genes associated with that disease. Reports should additionally be hyper-linked to key resources for easy access to relevant content. These reports enable the presentation of new biological information in a robust and accessible manner.
  3. Generate reports for each clinical trial that mentions a mutation. Statistics on the frequency and clinical trial categories in which mutations occur should also be provided.
  4. Create a freely-available public repository with data associating mutations, clinical trials, disease, HPO terms, and MeSH terms. Develop APIs to access the data programmatically.
  5. Repeat the analysis over multiple time frames, enabling future meta-analyses that may provide additional insights into mutation-disease associations over a period of time.
  6. Demonstrate via an example of how the meta-data extracted from this work can be used for machine learning.

It is hypothesized that creating a public repository of associations between clinical trials, disease terms, SNPs, and protein mutations—and making such a repository freely-available via HTML reports, processed data, and APIs—will enable researchers and clinicians to stay up-to-date.

Materials and methods

Two publicly-available datasets were used in this study: and HPO. The methods described here are also publicly-available at (

Datasets [5].

The complete repository of clinical trials displayed at is available in XML format with a well-defined schema. However, analyzing clinical trial text to derive valuable insights is still a challenge as it involves parsing free-text [17].

HPO [8].

HPO is a standardized vocabulary of phenotype abnormalities that are seen in humans [8]. HPO is a product of the Monarch Initiative and one of the thirteen driver projects in the Global Alliance for Genomics and Health (GA4GH [18]) strategic roadmap. The HPO ontology files are available in the OBO [19] flat-file format and are easy to read and parse. HPO annotations provide a correlation between HPO terms and genes. There are three annotation files that contain associations between genes and phenotypes. The HPO files used in this project consisted of 14,961 HPO nodes, with 18,547 parent-child relationships between the nodes. Furthermore, 820,297 gene-phenotype annotations mapped across 4,312 unique genes and 8,947 individual HPO terms.

For each node, when applicable, the HPO ontology files contain a reference to MeSH, UMLS, and SnomedCT ontologies. For example, the HPO node “id: HP:0000003” with name “Multicystic kidney dysplasia” maps to the following four cross-ontology terms.

  1. “xref: MSH:D021782”, which implies MeSH id D021782 and name “Multicystic Dysplastic Kidney.”
  2. “xref: SNOMEDCT_US:204962002”, which implies SNOMEDCT id 204962002 and name: “Multicystic kidney”
  3. “xref: SNOMEDCT_US:82525005”, which implies SNOMEDCT id 204962002 and name: “Multiple congenital cysts of kidney”
  4. “xref: UMLS:C3714581”, which implies UMLS id C3714581 and name: “Multicystic dysplastic kidney”

MeSH [15].

Although the XML does not contain MeSH ids, information about MeSH terms is present. The MeSH online tool [20] was used to retrieve MeSH ids from MeSH terms. MeSH ids are directly linked to HPO ids, in essence, enabling the association between MeSH terms to HPO nodes, as is discussed later in the Methods section.

Approaches for finding mutations

Mutation format.

The Human Genome Variation Society (HGVS) defines a format [21][22] for referencing variants. As per the specifications, all variants should be described at the DNA level, noting relations to an accepted reference sequence. Descriptions can be at the DNA-level (e.g., 123456A>T), RNA-level (e.g., 76a>u), and protein level (e.g., Lys76Asn). Ogino et al. 2009 [23] provides a good overview of mutation nomenclature used for molecular diagnostics.

RSids and SNPs.

The Single Nucleotide Polymorphism database (dbSNP) repository [24] assigns a unique id to variations including SNPs, short nucleotide insertions and deletions, and short tandem repeats. These ids are called RSids and appear in the format rs##. For example, the RSid rs35652124 maps to the following mutations in HGVS format NC_000002.12:g.177265345T>C, NC_000002.11:g.178130073T>C [25] and is a mutation on chromosome 2 at location 177265345, with associated gene NFE2L2. Public repositories, such as ClinVar [26] archive human genetic variants and interpretations of mutations’ significance to diseases. Such repositories use RSids as unique identifiers. ClinVar [24], for instance, has more than 400 thousand RefSNPs.

SNP extraction.

SNPs can be extracted with simple text processing methods as all SNPs follow the RSid format of beginning with the letters rs and having multiple numbers that follow the initial letters. For example, an SNP may be under the id rs9939609 or rs6971.

Protein mutation extraction.

Several tools are available to mine mutations from the text. Some examples of such tools are:

  1. MutationFinder [27] is a simple-to-use package that uses a rule-based approach with more than 1500 regular expressions to extract protein mutations from the text.
  2. Open Mutation Miner [28] is a tool that detects and annotates protein mutations by combining rules with the MutationFinder. It also maps the impact of the mutation by integrating Gene Ontology (GO) [29].
  3. SNP Extraction Tool for Human Variations (SETH) [30] is an entity recognition tool that extends MutationFinder. SETH can recognize the following subtypes of mutations: substitution, deletion, insertion, duplication, insertion-deletion (insdel), inversion, conversion, translocation, frameshift, short-sequence repeat, and literal dbSNP mention. SETH also normalizes the genetic variant to a standard RSid.
  4. tmVar [31] is a mutation extraction tool based on a conditional random field model and covers a wide range of sequence variants at both protein and gene levels in HGVS format.
  5. tmVar 2 [32] builds on tmVar to automatically extract and map variants to unique identifiers (dbSNP RSIDs). tmVar 2.0 achieved nearly 90% in F-measures for normalizing the mutations ids and also compared well to SETH.

Yepes and Verspoor, 2014 [33] provide an overview of relative performance between the different mutation extraction tools. For this study, the MutationFinder tool was chosen for its precision and recall. A text processing pipeline was developed to first extract RSids (SNP mutations) using pattern matching; the MutationFinder tool was then applied to extract protein mutations. No changes were made to the MutationFinder Java code.

Programming packages

Tools used throughout the project are displayed in Table 1. Java was the primary programming language.

Analysis steps

The seven main analysis steps are illustrated in Fig 1 and described in detail below.

  1. Download: XML files from and HPO data files.
  2. Parse: The Java SAX parser framework efficiently parsed the XML files. In this step, for a given clinical-trial XML file, a fully-instantiated JavaBean class was created to represent the Clinical Trial. Key XML. fields used in this study include Title, Summary, Study Type, Description, Outcomes, Arm, Study Design, MeSH Terms, Outcomes, Conditions, Intervention, Phase, Observational Model, and Keywords. The MeSH terms referenced in the XML were mapped to their MeSH ids using the procedure explained below:
    1. Created a list of MeSH terms referenced across all clinical trials.
    2. Retrieved MeSH ids using the MeSH online tool [20] for each of the MeSH terms in the list.
    In the same manner, the HPO ontology file was parsed to create a parent-child hierarchy: HPO annotation files were parsed, and associations between HPO nodes and genes were noted.
  3. Text Processing: The Apache OpenNLP library was utilized to parse the clinical trials into sentences. Using OpenNLP, a series of classes were created to effectively tokenize the various sentences. Regular Expressions were used to detect SNPs and protein mutations. For instance, detailed below is the process of detecting key entities:
    1. Parse XML using SAX Parser.
    2. Create a JavaBean instance with attributes.
    3. Tokenize text by splitting the paragraphs into sentences and then sentence to tokens.
    4. Regular Expressions were used to determine if a specific token was either a protein mutation or an SNP. As detailed in “SNP Extraction” and “Protein mutation,” particular regular expressions denoted the presence of a mutation.
  4. Text Analyzers: Several crawlers were created to traverse through the local XML files and extract relevant information. Functions of the text processors are the following: create an index of all clinical trials; associate conditions with the clinical trials; extract SNPs, protein mutations, and MeSH terms from the tokens; derive frequency information and reports for SNPs, protein mutations, HPO nodes, MeSH nodes, etc.; and map clinical trials to HPO terms (in essence, normalizing to HPO nodes). Normalization is discussed further below.
  5. Normalization: Clinical trials were mapped to HPO nodes through the following process:
    1. MeSH ids were associated with HPO ids using the HPO data file.
    2. HPO ids are linked to an HPO node. Thus, clinical trials were correlated to MeSH terms, MeSH ids, and finally HPO nodes.
    The steps normalized the HPO terms to standardize correlations between overlapping terms.
  6. Report Generators: Reports were generated to analyze the processed data, display detailed information for each of the mutations, and showcase elements of the clinical trials in which the mutations appear.
  7. Host Reports: The final reports are hosted on an AWS S3 bucket [37]. Note that these static-hyperlinked-HTML reports support user interactions. Java client APIs, along with a Google Colab document (Jupyter Notebook using Python), was created to make the produced analytics and results accessible programmatically.
Fig 1. Seven steps of the pipeline.

Methodology to mine to extract unique insights for understanding SNPs and mutations. Each of the steps is described in detail in the “Analysis Steps” section.

Machine learning example

We conducted a simple example of how the insights produced from this work can be applied biologically via machine learning. In this instance, clusters of similar HPO terms are desired for research purposes. It was decided to identify alike HPO terms by analyzing the correlations between SNPs and HPO terms. For example, if two HPO terms were linked to an SNP, those two terms would have a high probability of being related. The Java code for this example is available at the SNP Miner Trials homepage (package: To illustrate how machine learning can be applied to the results and analytics produced, the following procedure was applied to solve the presented example:

  1. Use the tabular data (available from the homepage) to create an incidence matrix where each row is an HPO node and each column an SNP. There are m HPO nodes and n SNPs. A value of one is inputted every time the mi HPO node is correlated with the ni SNP term. Else, a value of zero is inserted for the element.
  2. Normalize the data by creating a unit vector for each HPO term. Unit vectors are obtained by dividing each element of a row by the magnitude of that row.
  3. For each HPO term, compute the pair-wise dot product between its vector and all other vectors. The resulting vector is a metric of normalized correlation.
  4. Sort the results to create a prioritized list of related HPO terms

Hierarchical Clustering [41] or K-Means [42] could also be used to find clusters of related HPO terms. A similar process can be used with protein mutations—in place of SNPs—as well. Alternatively, HPO terms could be clustered based on both protein and SNP mutations. The rows and columns can be switched to cluster similar SNP/protein mutations by their associated HPO terms [43].


The “Results” section comprises of six sub-topics:

  1. Details on the created public repository to provide access to the data used, reports created, correlations mapped, and APIs produced.
  2. Insights about the corpus after normalizing the data using MeSH and HPO ontologies.
  3. Insights about the mined SNPs.
  4. Insights about the extracted protein mutations.
  5. Analysis of popular interventions.
  6. Findings related to the machine learning example.

Public repository

Web page to access longitudinal analysis data, reports, and APIs.

All analysis results are accessible via the SNP Miner Results home page, available at A view of the home page is seen in S1 Fig. The web page provides access to data and reports from multiple time frames. As of March 2020, there are two analysis time points: August 2019 and March 2020. Additionally, the home page has links to Java APIs and Google Colab pages, which facilitate easy local access to the insights and results of this research. The SNP Miner Results home page provides the latest analysis results, and—due to the constant influx of new clinical trials, enhancements to HPO, and HPO annotation files—the results are subject to change.

Java APIs, as well as a Google Colab Notebook (see S1 Fig) with Python, allow the results to be easily accessed programmatically.

The functionalities of the various APIs are to retrieve information about the following:

  1. The MeSH terms and MeSH ids used to tag the corpus
  2. HPO terms and their corresponding clinical trials
  3. RSids and their corresponding clinical trials
  4. *Relevant MeSH ids and their correlated clinical trials
  5. *Relevant HPO ids and their correlated clinical trials
  6. Protein mutations and their corresponding clinical trials

*Only the specific terms that have any correlation to a mutation are shown.

Additionally, there are results discussing the machine learning example mentioned earlier.

Term normalization

The clinical trial XML contains a field called “Condition”, which is a free-formed annotation associated with the clinical trial. S2 Fig shows frequently occurring conditions (referenced more than 1,000 times) across the clinical trial documents. Since these conditions are free-formed and not mapped to a standard ontology, multiple distinct terms refer to the same condition. For example, six terms that refer to “Type 1 Diabetes”—“Diabetes Mellitus, Type 1,” “Type 1 Diabetes,” “Type 1 Diabetes Mellitus,” “Type1diabetes,” “Type1 Diabetes Mellitus,” and “Diabetes Mellitus Type 1” appear throughout the clinical trials. Standard ontologies such as MeSH and HPO map these variant terms to a single ontology node: D003922 [39] for MeSH and HP:0100651 [40] for HPO. There were 87,656 unique conditions, and 559,918 total condition mentions. Thus, normalization was pivotal in standardizing the results.

In the XML data, each clinical trial contains a list of associated MeSH tags. As described in the “Methods” section, these MeSH tags were useful in linking MeSH terms to HPO terms and MeSH ids to HPO ids.

Using information about MeSH tags, multiple analytics were produced: 6,643 unique MeSH tags have been cited 568,784 times across the 332,418 clinical trials; approximately 81% of the clinical trials have a MeSH annotation, and around 62% of the trials have a MeSH annotation with an associated HPO term mapped to a gene. S2 Fig displays all of the MeSH terms with at least 2,000 total tags ranked by frequency.

Results from extracting RSids

There were 566 unique RSids across 368 clinical trials, with a total of 798 mentions. Table 2 contains the top three most frequently occurring RSids, while S2 Fig shows a tabular view of frequently occurring SNPs and HPO terms. rs12979860 co-occurs with “HP:0012115 Hepatitis” 33 times. rs12979860, which occurs near IL28B, is in fact used for selecting Hepatitis C treatment [44], validating the methodology and results. Other notable SNPs referenced multiple times across the corpus are rs6971, which appears is associated with brain diseases [46] and rs9939609, which is associated with fat mass and obesity [47]. All of these results help validate the pipeline employed since all of these SNPs have already been commonly known and studied.

Validation case.

To further validate the pipeline, 37 SNPs associated with “HP:0003002 Breast carcinoma” were analyzed. These SNPs are rs1011970, rs10407022, rs1045485, rs10941679, rs10995190, rs11045585, rs11133360, rs11249433, rs12762549, rs13281615, rs13387042, rs16942, rs1800566, rs2002555, rs2046210, rs2237060, rs2241193, rs2297480, rs236114, rs2380205, rs271924, rs2981582, rs3803662, rs3817198, rs4073, rs4646, rs4973768, rs614367, rs6504950, rs704010, rs7333181, rs7349683, rs889312, rs909253, rs9344, rs9457827, and rs999737. Each one of these were manually verified for associations with breast cancer. As expected, each and every one of them had a known association with breast cancer, further illustrating the accuracy and effectiveness of the methodology. The Java API toolkit includes an API that returns a list of SNPs for an associated HPO node.

MeSH terms, HPO terms, and reports.

S2 Fig illustrates the most prominent MeSH ids referenced across the 368 clinical trials with RSids. Interestingly, the first set of MeSH terms was related to Hepatitis, with more than 10% (37 out of 368) of clinical trials falling into this category, demonstrating the quantity of research involving mutations and Hepatitis.

The most cited HPO terms fall into the areas of Hepatitis, Diabetes, Cancer (Breast carcinoma, Leukemia), abnormality of the cardiovascular system, and Schizophrenia. S2 Fig shows the key HPO terms with associated SNPs across the clinical trial corpus. The 368 clinical trials mapped to 136 different HPO terms and were referenced 368 times. The frequency of HPO terms sheds light on the areas that researchers are prominently interested in.

Table 3 shows the top HPO nodes with the highest occurring RSids. Breast carcinoma had 38 unique RSids associated with it, suggesting that genetic mutations possibly influence Breast Cancer. Other diseases with the most number of associated RSids include Impulsivity, Aggressive behavior, Diabetes mellitus, Hepatitis, and Asthma.

An HTML report was created for each of the 566 unique RSids, and reports over multiple time periods are freely available via the home page ( As shown in S1 Fig, each report contains a list of the clinical trials in which the SNP appears, along with the sentences containing the SNP. Each clinical trial report also shows the mapped HPO as well as MeSH terms, both of which are hyperlinked to other reports and external resources. As shown in S1 Fig, the HPO terms and their associated genes are also displayed at the bottom of the report. All 566 SNPs are displayed on the left-hand side of the report to enable easy navigation across the RSids.

Similarly, an HTML report was generated for each of the 368 unique clinical trials that mentioned SNPs. Reports, over multiple time periods, are freely available. As shown in S1 Fig, all reports contain the details of the clinical trial, the list of SNPs mentioned, and the sentences in which each SNP appears. Every clinical trial report shows the mapped HPO and MeSH terms, which are also hyperlinked. S1 Fig highlights the unique RSid terms and their associated sentences, which are also displayed at the bottom of the report. All the 368 clinical trial ids are displayed on the left-hand side of the report to enable easy navigation across the clinical trials.

Results of extracting protein mutations from the clinical trial corpus using MutationFinder

There were 962 unique protein mutations across 1,939 clinical trials, with a total of 3,881 mentions.

Table 4 contains the top four most frequently occurring protein mutations. The protein L858R is cited in 293 clinical trials, out of which 233 clinical trials mapped to HPO node “HP:0030358, Non-small cell lung carcinoma,” suggesting a correlation between L858R and Lung Cancer. The 293 clinical trials that mention the L858R map to 21 HPO nodes, most of which are associated with Cancer. E.g., “HPO:0100526 Neoplasm of the lung”, “HP:0030731 Carcinoma”, “HP:0030692 Brain Neoplasm”, etc. Similarly, T790M (synonym, Thr790Met) is cited across 289 clinical trials, which frequently map to cancer-related HPO nodes, indicating the vast amount of Cancer research performed. V600E and T315I, with 228 and 98 citations respectively, are the next two most commonly cited protein mutations. V600E is associated with Cutaneous melanoma, Neoplasm of the large intestine, and Thyroid adenoma, while T315I is associated with Leukemia, Chronic myelogenous Leukemia, and Myeloid leukemia.

The 1,939 unique clinical trials that referenced protein mutations were subsequently analyzed. MeSH terms that appear frequently across clinical trials that contain protein mutations are shown in Fig 2. Fig 3 illustrates MeSH terms that frequently appear for both the RSid and protein mutation cases. In Fig 3, multiple MeSH terms are related to Hepatitis and Cancer, further demonstrating the quantity of research in these fields.

Fig 2. Bubble graph showing the key MeSH nodes used to tag clinical trials with protein mutations.

Fig 3. Common MeSH terms for clinical trials with RSid and protein mutation frequencies.

Similarly, Table 5 portrays the top HPO terms referenced across these 1,939 clinical trials with protein mutations. The HPO node HP:0030358 “Non-small cell lung carcinoma” is associated with 382 clinical trials, followed by HP:0100526 “Neoplasm of the lung” with 284 clinical trials. “Leukemia”, “Cutaneous melanoma,” “Myeloid Leukemia,” “Neoplasm,” “Chronic myelogenous leukemia,” “Myeloid leukemia,” “Carcinoma,” “Neoplasm of the large intestine,” and “Lymphoma” are the remaining HPO terms with the most number of associated clinical trials. The quantity of Cancer nodes possibly suggests a correlation between mutations and Cancer.

Table 5. HPO Terms with the most cited protein mutations found by MutationsFinder in

Next, analyzing the number of protein mutations for each of the reference HPO terms provides insights, as shown in Table 6. HP:0002664 “Neoplasm” has 75 associated protein mutations, while HP:0003002 ‘Breast Carcinoma’ is next with 73 mutations. “Carcinoma”, “Lymphoma,” “Neoplasm of the lung,” “Leukemia,” “Non-small cell lung carcinoma,” and “Non-Hodgkin lymphoma” are the other top-six HPO nodes with the most number of associated protein mutations.

Fig 4 shows the distribution of HPO terms across (a) all clinical trials, (b) those with RSids, and (c) those with protein mutations. Interestingly, Diabetes Mellitus is the most commonly occurring HPO Term across all clinical trials.

Fig 4. Frequency of different HPO terms across clinical trials, across trials with RSids, and across trials with protein mutations.

HTML reports were created for each of the 962 unique protein mutations and are freely available from the SNP Miner home page ( As shown in S1 Fig, each report contains a list of clinical trials where the protein mutation appears, along with the sentences containing the mutations. Each protein mutation report shows the mapped HPO as well as MeSH terms. All 962 protein mutations are displayed on the left-hand side of the report to enable easy navigation. Similarly, reports for each of the clinical trials which reference a protein mutation are also available.


Interventions (or treatments) are the focus of a clinical trial and are categorized into eleven different types, as shown in Table 7. There are 573,887 unique Intervention tags across the eleven different Intervention Types.

Each Intervention tag was categorized into one of two mutually-exclusive categories: one that had a clinical trial with an HPO term (and consequently was associated with a gene), and the other that did not have an HPO term. The last column shows the percentage of Intervention Types that were mapped to clinical trials with associated genes; the Radiation Intervention Type had the highest percentage with 83.2%, indicating the dependence of Radiation research on genetic information. Fig 5 shows four subgraphs: the first illustrates the relative frequency distribution of clinical trial interventions across the eleven categories; the second is the percent distribution of clinical trials with HPO nodes associated with genes; the third depicts the percent of the clinical trials which have an RSid, and the fourth displays percentages of clinical trials that have a protein mutation. As expected, clinical trials with the “Genetic Intervention” type had the highest percent of clinical trials with SNPs and protein mutations, with 2.34% and 4.1%. Intervention types “Drug” and “Radiation” also had a high incidence of protein mutations with 1.4% and 1.04%, respectively, of the clinical trials referencing mutations.

Fig 5. Percentage of clinical trials in each of the eleven categories with RSids and protein mutations.

(a) The first graph shows the relative frequency of clinical trials in each of the eleven Intervention types. (b) The second shows the percent of clinical trials in each of the categories that link to an HPO term and has an associated gene. (c) The third shows the relative frequency of clinical trials in each of the categories that had an associated RSid. (d) The fourth shows the percent of clinical trials in each of the categories that had an associated protein mutation.

Machine learning application: Results

Three representative HPO nodes were selected to demonstrate the results of the clustering by SNP. The HPO nodes most similar to each are shown in Table 8 and discussed below.

  1. HP:0001909 Leukemia: As expected, the most common HPO nodes related to “HP:0001909 Leukemia” are all associated with different kinds of Leukemia, validating the methodology. Yet, lower in the list, nodes like “HP:0004757 Paroxysmal atrial fibrillation” seem out of place. However, patients with Leukemia are treated with the drug, Ibrutinib, a Bruton’s tyrosine kinase inhibitor [48] that has two adverse effects: atrial fibrillation and bleeding. Therefore, “HP:0004757 Paroxysmal atrial fibrillation” is correctly linked to “HP:0001909 Leukemia,” illustrating that this machine learning example incorporates multiple features of HPO Nodes and their corresponding mutations to highlight interesting and possibly novel correlations. Similarly, Leukemia is related to Dysmenorrhea [49] and Depressivity [50] through this methodology, illustrating the effectiveness of such Machine Learning applications in possibly finding novel correlations between diseases/conditions.
  2. HP:0000819 Diabetes mellitus: As expected, “HP:0000819 Diabetes mellitus” is associated with different elements of diabetes, kidneys, weight, insulin, the gastrointestinal tract, livers, and the cardiovascular system, further validating the methodology and pipeline.
  3. HP:0001824 Weight loss: As the last example, the generic non-disease term “Weight Loss” was selected. “Weight Loss” still worked outstandingly in the algorithm as common correlations were related to the gastrointestinal tract, blood-forming tissues, diabetes, kidneys, insulin, liver, and the cardiovascular system.
Table 8. Related HPO terms using co-occurrences of RSids and HPO terms.

Readers are encouraged to use the APIs developed to try out the complete analysis using both SNPs and protein mutations.

Conclusion and future work

In this work, protein mutations and SNPs were successfully mined from Additionally, mutations and clinical trials were associated with HPO and MeSH ontologies. The benefits of using ontologies to help normalize free-formed text were demonstrated, and the mapping from MeSH to HPO also enabled the finding of genes associated with the HPO term. Unique reports for each mutation and clinical trial were created, helping researchers mine associations between mutations, genes, and diseases. These reports are freely available on the web, along with APIs (Java and Google Colab notebooks) for programmatic access. Further, the publicly-available site ( contains analysis at multiple time points, further providing researchers with longitudinal information about clinical trials and associated entities, as well as demonstrating the reproducibility of the methods. The programmatic access of the data connecting SNPs and protein mutations with MeSH and HPO terms can also be useful for machine learning, as demonstrated above.

Future work would enhance the developed framework to include other mutation types and generate further insights from data. This framework, utilizing the created pipeline, can additionally be applied to other scientific corpora, such as PubMed [51] and PubMed Central [52], another area of future work. Additional insights can be obtained by extracting biomedical entities from the clinical trials corpus. For e.g., U.S. Food and Drug Administration (FDA), Center for Biologics Evaluation and Research (CBER), and Center for Drug Evaluation and Research (CDER) [53] have a rich repository of drug information.

Supporting information

S1 Fig. Screen shots of SNPMiner homepage, various reports, and API toolkts.


S2 Fig. Graphs of different analysis reports.



The author would like to thank Ayush Alag, Princeton University, for his valuable feedback on the manuscript and guidance during the project. Further, the author would like to thank Dr. Eric Nelson, The Harker School, for his encouragement on the project and valuable feedback on the manuscript.


  1. 1. What are single nucleotide polymorphisms (SNPs)? Available at: Accessed March 2020
  2. 2. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: Accessed August 2019
  3. 3. What are genome-wide association studies? NIH Genetics Home Reference.
  4. 4. Yepes AJ, MacKinlay A, Gunn N, Schieber C., Faux N., Downton M., et al. A hybrid approach for automated mutation annotation of the extended human mutation landscape in the scientific literature. AMIA Annu Symp Proc. 2018;2018:616–623. Published 2018 Dec 5.
  5. 5. Available at: Accessed August 2019
  6. 6. Zhang X. A., Yates A., Vasilevsky N., Gourdine J. P., Callahan T. J., Carmody L. C., et al. Available at Semantic integration of clinical laboratory tests from electronic health records for deep phenotyping and biomarker discovery. NPJ digital medicine, 2, 32. (2019).
  7. 7. The international standard for identifying health measurements, observations, and documents. Available at
  8. 8. The Human Phenotype Ontology Available at
  9. 9. Gandy LM, Gumm J, Blackford AL, Fertig EJ, Diaz LA Jr. A Software Application for Mining and Presenting Relevant Cancer Clinical Trials per Cancer Mutation. Cancer Inform. 2017;16:1176935117711940. Published 2017 Jun 22.
  10. 10. Xu J, Lee HJ, Zeng J, Wu Y., Zhang Y., Huang LC, et al. Extracting genetic alteration information for personalized cancer therapy from J Am Med Inform Assoc. 2016;23(4):750–757.
  11. 11. Su EW, Sanger TM. Systematic drug repositioning through mining adverse event data in PeerJ. 2017;5:e3154. Published 2017 Mar 23. Reference: pmid:28348935
  12. 12. Pradhan R, Hoaglin DC, Cornell M, Liu W, Wang V, Yu H. Automatic extraction of quantitative data from to conduct meta-analyses. Journal of Clinical Epidemiology. 105.
  13. 13. Sfakianaki P, Koumakis L, Sfakianakis S, et al. Semantic biomedical resource discovery: a Natural Language Processing framework. BMC Med Inform Decis Mak. 2015;15:77. pmid:26423616
  14. 14. What Are RS Numbers (Rsid)?
  15. 15. NIH MeSH
  16. 16. Provides a link between genes and HPO terms. All phenotype terms associated with any disease that is associated with variants in a gene are assigned to that gene in this file.
  17. 17. Clinical trials XML schema
  18. 18. Global Alliance for Genomic Health
  19. 19. The OBO Flat File Format Specification, version 1.2
  20. 20. NCBI MeSH
  21. 21. Dunnen J. T., Dalgleish R., Maglott D. R., Hart R. K., Greenblatt M. S., McGowan‐Jordan J., Roux A., Smith T., Antonarakis S. E. and Taschner P. E. HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Human Mutation, 37: 564–569
  22. 22. Sequence Variant Nomenclature.
  23. 23. Ogino S, Gulley ML, den Dunnen JT, Wilson RB; Association for Molecular Pathology Training and Education Committee. Standard mutation nomenclature in molecular diagnostics: practical and educational challenges [published correction appears in J Mol Diagn. 2009 Sep 1;11(5):494]. J Mol Diagn. 2007;9(1):1–6.
  24. 24. dbSNP.
  25. 25. dbSNP rs35652124.
  26. 26. Landrum MJ, Lee JM, Benson M, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062–D1067. pmid:29165669
  27. 27. Caporaso JG, Baumgartner WA Jr, Randolph DA, Cohen KB, Hunter L. MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics. 2007;23(14):1862–1865.
  28. 28. Naderi N, Witte R. Automated extraction and semantic analysis of mutation impacts from the biomedical literature. BMC Genomics. 2012;13 Suppl 4(Suppl 4):S10. Published 2012 Jun 18.
  29. 29. Gene Ontology
  30. 30. Thomas P., Rocktäschel T., Hakenberg J., Mayer L., and Leser U. SETH detects and normalizes genetic variants in text. Bioinformatics (2016)
  31. 31. Wei CH, Harris BR, Kao HY, Lu Z. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013;29(11):1433–1439.
  32. 32. Wei CH, Phan L, Feltz J, Maiti R, Hefferon T, Lu Z. tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics. 2018;34(1):80–87.
  33. 33. Yepes JA, Verspoor K. Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Res. 2014;3:18. Published 2014 Jan 21.
  34. 34. Oracle: Parsing an XML File Using SAX
  35. 35. Welcome to Apache OpenNLP
  36. 36. Bootstrap
  37. 37. Start Building on AWS Today
  38. 38. What is Colaboratory?
  39. 39. Diabetes Mellitus, Type 1.
  40. 40. Diabetes Mellitus HP:0000819.
  41. 41. What is Hierarchical Clustering?
  42. 42. K Means
  43. 43. Alag, S. Collective Intelligence in Action, 2008 ISBN: 1933988312, Manning Publications Co.
  44. 44. rs12979860: SNPedia
  45. 45. rs8099917: SNPedia
  46. 46. rs6971: SNPedia
  47. 47. rs9939609: SNPedia
  48. 48. Khalid S, Yasar S, Khalid A, Spiro T, Haddad A, et al. Management of Atrial Fibrillation in Patients on Ibrutinib: A Cleveland Clinic Experience. Cureus. 2018 May; 10(5): e2701. pmid:30062075
  49. 49. Wu Q, Lian Y, Chen L, Yu Y, Lin T Alleviation of Symptoms and Improvement of Endometrial Receptivity Following Laparoscopic Adenomyoma Excision and Secondary Therapy with the Levonorgestrel-releasing Intrauterine System. Reprod Sci. 2020 Jan 6.
  50. 50. Papathanasiou IV, Kelepouris K, Valari C, Papagiannis D, Tzavella F, Kourkouta L, et. al Depression, anxiety and stress among patients with hematological malignancies and the association with quality of life: a cross-sectional study. Med Pharm Rep. 2020 Jan;93(1):62–68.
  51. 51. PubMed
  52. 52. PubMed Central
  53. 53. U.S Food & Drug Administration