Bioinformatics pipeline for the systematic mining genomic and proteomic variation linked to rare diseases: The example of monogenic diabetes

doi:10.1371/journal.pone.0300350

Fig 1.

General architecture of the pipeline.

Module 1: extract variants from Ensembl affecting genes linked to monogenic diabetes. Module 2: filter variant by consequence on the protein. Module 3: extract variants from ClinVar affecting genes linked to monogenic diabetes. Modules 4 and Module 5: extract the variants from the literature using the mining by Rafique et al. Module 6: consolidate the variants in a single table. Translation: produce the possible variant protein sequences. MODY—maturity-onset diabetes of the young, MD—monogenic diabetes, ND—neonatal diabetes, API—application programming interface, PMID—identifiers of scientific publications from the PubMed database, dbSNP identifiers—identifiers of the genomic variants from the dbSNP database, that start from “rs”.

More »

Expand

Fig 2.

Distribution of the consequence types within pathogenicity categories of the variants.

A: Variants from Ensembl, B: Variants from ClinVar. The heatmap shows the percentage of variants within pathogenicity categories for each consequence type. Arrows indicate the consequence types that were left in the Ensembl reference table created in Module 2 after filtering. “Stop lost” is marked with a blue arrow as it has conflicting evidence in two sources (see detailed in the text).

More »

Expand

Fig 3.

A: All isoforms of the protein product of HNF1A with the amino acid variants from the database mapped on the sequence. B: Examples of the random fragments of the products of the canonical isoforms of HNF1A and ABCC8 showing the difference in density of variants from level 1 and level 2 databases. The top row of blue dots represents the positions of the variants from the level 2 database, whilst the bottom green row represents the positions of the variants from the level 1 database. Being the most well-studied gene, HNF1A has an almost equal number of dots in both rows as most of the reported variants are revised and confirmed as pathogenic. ABCC8 has a lot of variants in the level 1 database that are not confirmed as pathogenic and, thus, are not in the level 2 database.

More »

Expand

Fig 4.

Venn diagram representing the number of variants in two levels of the database and the number of variants taken from different sources.

Level 1 database (at the left) consists of i. ClinVar variants retrieved when querying three phenotypes, i.e. “MODY”, “monogenic diabetes”, and “neonatal diabetes” mapped to Ensembl; ii. All variants from Rafique et al. mapped to Ensembl. Level 2 database (at the right) consists of i. ClinVar “pathogenic” + “likely pathogenic” variants; ii. Variants from Rafique et al. excluded BLK, KLF11, and PAX4.

More »

Expand

Fig 5.

Number of variants in the genes with the largest number of variants.

Selected are the genes in which the number of variants in the level 1 database is more than 20. The full bars represent the number of variants in the level 1 database, and the brown part of the bars represents the number of variants in the level 2 database.

More »

Expand