Fig 1.
Biomedical Data Commons is a knowledge graph that integrates multiple data types.
A: Workflow for cleaning, formatting, ingesting, and accessing data in the Google Biomedical Data Commons knowledge graph. B: Current state of the Google Biomedical Data Commons graph. The size of the node indicates the number of unique entities of that type in the graph. Solid edges depict explicit relationships between two entity types with edge width corresponding to the number of unique links between the entity types. Dashed line edges denote implicit relationships in the graph. C: A depiction of the subgraph of Biomedical Data Commons displaying the Gene, GeneticVariant, and GeneGeneticVariantAssociation nodes and edges in total (top panel) and the subset of which are reported as significantly associated in Whole Blood by GTEx (bottom panel). Node size and edge width correspond to the number of unique entities of that type and relationships between two entity types respectively. This represents how a user can use the Data Commons API to search and retrieve information contained in a subset of the graph in which they are interested.
Fig 2.
Schematic of the SNP Prioritization Pipeline.
Genomic coordinates of genetic variants, regulatory elements, 3D connectome, genes are input into step one of the SNP Prioritization Pipeline. The output from Step 1 are associated genetic variants—regulatory elements—genes, which together formed trios. List of the unique genetic variants, regulatory elements, and genes that participate in these trios are also outputted. The gene list from step 1 of the SNP Prioritization Pipeline can be used as input for Biomedical Data Commons queries. The trios generated in step 1 are used as input into step 2 of the SNP Prioritization Pipeline along with optional input cell type-specific gene expression data. The output is visualizations of gene regulatory networks and a ranked list of genetic variants and genes. Input and output data at each step of the pipeline is color coded by type of data: genetic variants (green), regulatory elements (blue), genes (gold), 3D connectome (black), gene expression data (magenta), gene regulatory networks (turquoise), and ranked list of genetic variants and genes (orange). Optional input data is denoted by grey text, grey outline of the input box, and grey input arrow. *denotes input data that is cell type-specific and needs to belong to the same cell type of interest throughout the pipeline.
Fig 3.
Pipeline prioritizes common non-coding genetic variants that are cell type specific.
A: Overview of the pipeline that identifies genetic variants in regulatory elements of interest distally connected to genes via the three-dimensional chromatin conformation. B: Clinical significance of pipeline output genetic variants. C: Functional category of input and output genetic variants. D: The amount of time to run clinical significance, functional category, and significant gene association analyses using local data scientist approach involving data download (blue) or Data Commons (gold). **** p < 0.0001. E: Venn diagram of the overlap in genetic variants (left panel) and genes (right panel) identified by the pipeline using H3K27ac HiChIP and ATAC-seq datasets from GM12878 cells (green) and primary Naïve T (green), Th17 (gold), and Treg (burgundy) cells as input. F: KEGG pathways associated with the target genes of pipeline genetic variants. G: Histogram of the minor allele frequency of the pipeline genetic variants. Red line at 0.02 indicates common cutoff for uncommon genetic variants. H: Histogram of pipeline genetic variants CADD score with red line at a score of 15 indicating a common cutoff for deleterious variants. I: Scatter plot of the top binding motifs of pipeline identified regulatory elements. J: The chromosomal location of input and output genetic variants. Input SNPs refers to the original candidate list of 12,974 genetic variants that were reported significantly associated with T1D in the GWAS catalog or are in linkage disequilibrium. The Output SNPs refers to the 602 genetic variants that were identified as candidates by the pipeline.
Fig 4.
HLA-DRB1 and HLA-DRQ1 are top pipeline identified candidates for T1D-associated genes in the HLA locus.
A: Visualization of the HLA component of interconnected pipeline genetic variant–regulatory element–gene trios (chr6: 30,000,000–33,220,000). B: Bipartite graph of the HLA component with gene and genetic variants as nodes and chromatin connections as edges. Node color indicates closeness centrality score with gold being most connected and purple being least connected nodes in the graph. Gene nodes are labeled, and genetic variant nodes are unlabeled. C: Bipartite graph of HLA component with gene and regulatory elements as nodes and chromatin connects as edges. Gene nodes are labeled and white. Regulatory element nodes are colored by type and labeled by the number of unique genetic variants contained in the regulatory element. The width of edges indicates connectivity strength as indicated by the number of unique HiChIP reads. D: Circos plot of the chromatin connectivity at 5 kb resolution in the HLA locus. The nodes are sections of the genome and the edges are the chromatin connectivity with the width indicating connectivity strength. An asterisk labels the starting (chr6: 30,000,000; green) and terminating (chr6: 33,220,000; gold) nodes of the plot. GM12878 (left panel) and Treg (right panel) pipeline trio contacts are visualized.
Fig 5.
Novel genetic variant rs14004 is a candidate for gene expression regulation of HLA-DRB1 and HLA-DQB1.
A: Visualization of the portion of the genome that interacts with HLA-DRB1 and HLA-DQB1 (chr6: 32,100,000–33,100,000). ATAC-seq and CTCF, RAD21, STAT5, and TCF3 ChIP-seq raw read visualization (top panel). Cohesin (black) and H3K27ac (blue) HiChIP raw reads virtual 4C plots centered on the Notch 3’ UTR, rs14004, HLA-DRB1 TSS, HLA-DQB1 TSS, and rs9986640 (bottom panel). B: Schematic of the chromatin connectivity between the genetic variants and the genes as represented by the raw data for chr6: 32,100,000–33,100,000. C: Major (green) and minor (blue) allele frequencies from 1000 Genome Project for rs14004 and rs9986640. D: Primary whole blood RPKM values for NOTCH4, HLA-DRB1, and HLA-DQB1 of healthy and Type 1 Diabetes patients.
Fig 6.
IL2RA is a top pipeline identified candidate.
A: Visualization of the IL2RA component of interconnected pipeline genetic variant–regulatory element–gene trios (chr10: 5,765,000–6,355,000). B: Bipartite graph of the IL2RA component with gene and genetic variants as nodes and chromatin connections as edges. Node color indicates closeness centrality score with gold being most connected and purple being least connected nodes in the graph. Gene nodes are labeled, and genetic variant nodes are unlabeled. C: Bipartite graph of IL2RA component with gene and regulatory elements as nodes and chromatin connects as edges. Gene nodes are labeled and white. Regulatory element nodes are colored by type and labeled by the number of unique genetic variants contained in the regulatory element. The width of edges indicates connectivity strength as indicated by the number of unique HiChIP reads. D: Circos plot of the chromatin connectivity at 5 kb resolution in the IL2RA locus. The nodes are sections of the genome and the edges are the chromatin connectivity with the width indicating connectivity strength. An asterisk labels the starting (chr10: 5,765,000; green) and terminating (chr10: 6,355,000; gold) nodes of the plot. GM12878 (left panel) and Treg (right panel) pipeline trio contacts are visualized.
Fig 7.
rs61839660 identified as a candidate for gene expression regulation of IL2RA and IL15RA.
A: Visualization of the portion of the IL2RA gene regulatory network (chr10: 5,765,000–6,355,000). ATAC-seq and BCL11A, IKZF1, and STAT5 ChIP-seq raw read visualization (top panel). Cohesin (black) and H3K27ac (blue) HiChIP raw reads virtual 4C plots centered on the IL15RA TSS, IL2RA 3’ UTR, IL2RA TSS, rs61839660, and rs198390 (bottom panel). B: Schematic of the chromatin connectivity between the genetic variants and the genes as represented by the raw data for chr10: 5,765,000–6,355,000. C: Major (green) and minor (blue) allele frequencies from 1000 Genome Project for rs61839660, and rs1983900.