Figures
Abstract
Graphia is an open-source platform created for the graph-based analysis of the huge amounts of quantitative and qualitative data currently being generated from the study of genomes, genes, proteins metabolites and cells. Core to Graphia’s functionality is support for the calculation of correlation matrices from any tabular matrix of continuous or discrete values, whereupon the software is designed to rapidly visualise the often very large graphs that result in 2D or 3D space. Following graph construction, an extensive range of measurement algorithms, routines for graph transformation, and options for the visualisation of node and edge attributes are available, for graph exploration and analysis. Combined, these provide a powerful solution for the interpretation of high-dimensional data from many sources, or data already in the form of a network or equivalent adjacency matrix. Several use cases of Graphia are described, to showcase its wide range of applications in the analysis biological data. Graphia runs on all major desktop operating systems, is extensible through the deployment of plugins and is freely available to download from https://graphia.app/.
Author summary
Graphia is a new visual analytics platform specifically created for the network-based analysis of large and complex data, such as that generated in huge amounts by modern biological analyses. It works in a data agnostic, hypothesis-free manner to generate correlation networks from any table of numerical or discrete values, thereafter providing a means to rapidly visualise the often very large networks that result, in either 2D or 3D space. Following network construction, the tool offers an extensive range of analysis algorithms, routines for network transformation, and options for the visualisation of metadata. This provides a powerful analysis solution for the exploration and interpretation of high-dimensional data from any source, as well as any data already defined as a network. Several use cases of Graphia are described to showcase its wide range of applications in the analysis biological data. Graphia is open source and free to all.
Citation: Freeman TC, Horsewell S, Patir A, Harling-Lee J, Regan T, Shih BB, et al. (2022) Graphia: A platform for the graph-based visualisation and analysis of high dimensional data. PLoS Comput Biol 18(7): e1010310. https://doi.org/10.1371/journal.pcbi.1010310
Editor: Eli Zunder, University of Virginia, UNITED STATES
Received: July 16, 2021; Accepted: June 16, 2022; Published: July 25, 2022
Copyright: © 2022 Freeman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Graphia is an open-source tool and as such all code is freely available. See: https://github.com/graphia-app/graphia.
Funding: Graphia was originally designed and built by Kajeka Ltd., a University of Edinburgh spin-out company (2015-2020). We would like to acknowledge all those who supported this venture, in particular grant funding from Scottish Enterprise (SMART/14/034 / 14/9168). Although no specific academic grant funding was received for this project, during a period of Graphia’s development by Kajeka, TCF and JP were employed by the University and supported by the Roslin Institute’s Strategic Grant from the UK’s Biotechnology and Biological Sciences Research Council (BBSRC) [BBS/E/D/10002071]. More recently Janssen Immunology have funded further development of the tool (core funding). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The study of interactions between entities is a cornerstone of modern analytics. In biology, efforts to map the ‘interactome’—all the interactions between the components of a biological system—have been underway for some time, generated by a number of complementary approaches [1,2]. Networks of biological data may also be used to chart diverse phenomena such as the spread of disease, the interactions between drugs and their targets, and the evolutionary relationships between species. Many data from other sectors are also inherently graph-based in structure. For example, interactions on social media platforms, customer/client relationships, communication and transport systems, computer networks and many other real-world systems. Matrices of numerical data that do not inherently possess a network structure can also be analysed using graph-based approaches. Wherever it is possible to calculate the distance between entities, a graph can be constructed using high confidence measures to define the edges between entities, represented by nodes. In biology, such an approach is already widely used to analyse high dimensional data, in particular to construct and analyse gene coexpression networks [3,4], but the approach is applicable to any numerical or categorical data from any source.
Given the explosion in the availability of data in recent years and the potential to visualise and analyse it using graph-based approaches, a variety of software tools to support these activities have been developed. In biology, Cytoscape [5] is perhaps the most widely used software for performing graph analytics. It has a large user base and supports many ‘apps’ (plugins) created by the community for the performance of specific graph-based analysis tasks. Other network visualisation and analysis tools include; Gephi [6], Tulip [7], Bandage [8], Graphviz [9], Pajek [10], yEd (yFiles, Tübingen, Germany), BioLayout [4,11], Social Network Visualiser [12] and NodeXL [13]. We have compared a number of the most commonly used network analyses tools used in Table 1 and in the case of Gephi and Cytoscape compared some of the key aspects of graph visualisation (S1 Fig). There are also a range of web-based software tools exclusively designed to visualise portions of data, often from a designated database, such as String [14], GeneMania [15] and Neo4J Bloom [16]. Some of these tools are focused on supporting a particular community, whilst others possess functionality tailored towards specific tasks or data types and include a mix of open-source projects and commercial tools. Others provide open-source code repositories for graph visualisation and analysis algorithms [17], or share repositories of graph data [18–20]. For a more comprehensive review of network analysis tools and resources, see [21].
Despite the availability of a wide range of downloadable applications, web-resources and code libraries to support graph-based analyses, there is a pressing need for easy-to-use software that supports the rapid visualisation and analysis of relatively massive networks. To address this need, we developed Graphia—a general purpose graph analysis tool that supports the integration, visualisation, analysis, and interpretation of a wide variety of data types. Here we provide an overview of Graphia’s core functionality for the analysis of graphs and describe a number of case studies in which it is applied to solve problems associated with the analysis of data derived from the biological sciences.
Methods
Design criteria
The following features were considered core to the design of Graphia:
- Data and operating system agnostic. Import data from any source saved in standard file formats. The software should run on all major desktop operating systems and modern hardware configurations.
- Fast and scalable. Support the rapid loading of data, fast computation of graph layout and analysis algorithms, high quality data visualisations. Deliver smooth and responsive graphical rendering of millions of data points (node/edges) on standard desktop hardware.
- Dynamic rendering. Visualise in real time changes to the graph structure associated with alterations in input parameters or additional data.
- 3D graph visualisation. Provide a navigable and immersive environment in which to explore and interpret large and complex graph topologies.
- Correlation graphs as an essential function. Rapidly convert any numerical or categorical data table into a correlation graph, supporting pattern finding and data mining.
- Attribute handling and visualisation. Visualise attributes (metadata) associated with nodes and edges using colour, size and text to distinguish between attribute values.
- Advanced analysis capabilities. Support a wide range of analytical algorithms and approaches that empower a user to explore, query and interpret data, such as the k-NN algorithm [22] for edge pruning, and the MCL [23] and Louvain [24] graph clustering algorithms.
- Extensible. Provide extensible architecture through use of a plugin system to allow the core to be extended or adapted for specific application areas or data types.
- User Interface. Provide a simple and intuitive user interface (UI) that is easy to navigate, featuring a graph display area supplemented with a table listing selected nodes and associated attributes and data values. The UI should provide easy access to menus providing functionality and display active transformations and visualisations (Fig 1).
(A) Graph display area, showing correlation graph with a cluster selected (unselected nodes faded). (A1) Display context menu options (right click). (B) Node (row) attribute display area. (B1) Table of selected nodes and their attributes, imported and calculated within the tool. (B2) Data plot area, in the case shown here a mean histogram of selected node values. (B3) Visualisation of column annotations. (B4) Data plot context menu options for changing plot (right click). (1a) Add Transform button, (1b) Active transforms, (2a) Add visualisation button, (2b) Active visualisations, (3) General toolbar, (4) Attribute parameter selection, (5) Display of graph metrics (number of nodes, edges, components), (6) Plot/table function toolbar.
Code architecture
Graphia is written in C++17 and is built upon Qt version 5, the cross-platform widget toolkit. For graphics, the industry standard OpenGL is used. The minimum driver support required is version 3.3 core profile, but more modern extensions will be used if they are available. Various open source libraries are employed, mostly for loading external data formats. These libraries and their associated licenses are enumerated in the About dialog of the application, accessible from the Help menu.
Graphia is architected so that loading, and data type specific user interfaces are confined to plugins. These are independent modules to the core application and can be removed or added without affecting any base functionality.
At the highest level the code is organised hierarchically into four separate directories:
- app—the core application code
- plugins—the existing bundled plugins
- shared—code used by both the core and plugins (this includes interface headers)
- thirdparty—any library code not authored locally
These are further divided into subdirectories dealing with specific areas of functionality. Graphia has been developed using standard object-orientated best practices. Continuous integration is employed to prevent portability build regressions, using recent versions of the compilers GCC, clang and MSVC. In addition, static analysis tools such as clang-tidy and cppcheck are used to identify potential problems early. CMake is used as a build system, and is set up for Linux, Windows and macOS compilation.
Implementation
Data import.
Graphia has been designed to import data encoded in a variety of standard and non-standard file formats. These include standard graph-based file formats such as BioPAX OWL ontology (.owl), JSON graph (.json), GraphML (.graphml), Graph Modelling Language (.gml), Cytoscape Exchange Files (.cx/.cx2) and MATLAB data file (.mat) formats, but also unstanderdised formats such as edge lists, adjacency matrices, and tabular data prepared for correlation analyses. Using these file formats, a wide variety of data may be imported into Graphia, not only in terms of defining the nodes and edges of a graph, but user-defined attributes or metadata.
Fast and scalable.
Existing graph visualisation tools either fail to render very large graphs effectively, or the ability to interact with a graph once rendered is limited and slow. Therefore, all aspects of Graphia’s functionality have been engineered to run quickly. Graphia can render graphs millions of data points on relatively commonplace hardware, where interaction with them is fast and fluid. This has been achieved through the use of optimal coding practices and parallelisation of computationally intensive analysis routines, e.g. calculation of correlation matrices, graph layout, clustering.
Dynamic graph layout and rendering.
Graph layout is an iterative process. Many programs only display the results of a layout algorithm after it has run a defined number of iterations. With Graphia, the layout is shown live, such that graphs ‘unfold’ in real time. However, the true power of dynamic graphs is realised when a transformation operation is performed or following the addition of new data. These changes are immediately reflected in the appearance of the graph. As graphs change dynamically, there is a need to identify the graph components and map how they interact when construction parameters are adjusted. The ability to quickly identify and move between components is a unique feature of Graphia. Components are rendered in a concentric pattern, arranged large-to-small. Smaller components may be filtered away using a transform. Most existing network analysis tools render graphs in 2D. The layout algorithm implemented in Graphia is innovative in that it applies current force-directed layout techniques, but in a dynamic fashion. Graphia renders graphs in 3D or 2D, making use of modern graphics hardware to display extremely large graphs efficiently with options for node/edge shading, relative node sizing and spacing (Fig 2A–2F). When graphs are relatively small or there is a need to share images by a conventional medium, i.e. a document, 2D graph visualisations have advantages. However, 2D visualisations are limiting when there is a need to display and interact with large graphs with complex topologies.
(A) 3D perspective view, smooth shading (the default), with visualisation of node categorical attribute (MCL cluster). (B) 3D orthographic view, flat shading (no perception of distance—all nodes same size, unless sized by attribute value). (C) 3D perspective view, smooth shading. (D) 2D view, smooth shading. (E) 2D view, flat shading. (F) compressed 2D layout, flat shading, showing node overlap view. Visualisation of (G) Betweeness centrality values, (H) Eccentricity values, (I) PageRank values. G-I are continuous (numerical) attributes, so a colour spectrum and size gradient is used for node display (2D, smooth shading). Betweenness and eccentricity are calculated for both nodes and edges, therefore visual encoding is applied to both.
Attribute-to-visual mapping.
Attributes are data values associated with nodes/edges. These can be user-defined or calculated by Graphia. For example, a node representing a person may be associated with knowledge of their gender, occupation, socioeconomic class, ethnicity, etc. (categorical attributes), as well as their height, age, weight, years in employment (numerical attributes). Colour can be used to represent categorical attributes, with nodes sharing the same attribute being assigned the same colour. In the case of numerical attributes, colour and size can be used to represent the value according to a spectrum, e.g. from small white nodes to big red nodes to represent low and high values, respectively. Both types of attribute may also be calculated from the graph itself, e.g. the assignment of nodes to clusters or calculation of node degree, PageRank values etc. (Fig 2G–2I). Visualising attribute values may help explain graph structure, for instance an area of a graph might be visibly associated with nodes of a given attribute. Attributes may also be used to analyse the statistical associations with graph topology, for example, graph clusters may be analysed for the enrichment of nodes with a given attribute. Attribute values may also be specific to only a single element, e.g. a unique node name.
Results
Described below are a number of use cases for Graphia in the context of biological data.
Case Study 1: Visualisation of phylogenetic trees
Hierarchical data structures are often represented by tree graphs and used in biology to represent relationships between species, strains, samples or genes. While trees are an intuitive way of visualising such relationships, when the number of branches on the tree become large, the ability to display such graphs at a local or global level is challenging. Here we show two examples of taxonomic trees visualised by Graphia representing the different levels of phylogeny, from a central node representing the class of organisms, up through branches representing the order, family, genus, with species and subspecies being the leaves of the tree. The examples described are taxonomic trees for all mammals and insects (Fig 3), as defined by the NCBI Taxonomy database [25]. The taxonomic tree of mammals consists of 9843 nodes and 9862 edges and is shown in 2D with nodes coloured by type, i.e. what level of the taxonomic tree they represent (Fig 3A). A small section associated with apes is highlighted (Fig 3B). When the graph is loaded using the WebSearch plugin, selection of a single node automatically initiates a search for the name of the selected node Fig 3C. Shown in Fig 3D is a taxonomic tree of all insect species, consisting of 275,328 nodes and 275,528 edges displayed in 2D. Graphia provides a unique environment to search, cluster and explore such data, the third dimension greatly assists in visualising the structure of such large graphs.
(A) A taxonomic tree of all mammals was downloaded from the NCBI’s Taxonomy database, with nodes coloured according to type, i.e. subspecies (blue), species (pink), genus (orange), etc. The graph comprised of 9,843 nodes and 9,862 edges and is shown with a 2D layout. (B) Zoomed-in view of the area in square shown in A, with a single node selected (Western gorilla). (C) Right click on a node provides the ability to search the web for the node identifier either via Google or through a predefined database selected in the Network tab of Options dialogue. (D) Taxonomic tree of all insects from the NCBI’s Taxonomy database, nodes coloured by Louvain cluster. The graph consists of 275,328 nodes and 275,528 edges and in this respect represents a large graph where visualisation in 2D is challenging.
Case Study 2: Analysis of single cell transcriptomics data
Single cell RNA sequencing (scRNA-Seq) generates gene expression profiles for thousands of individual cells in a single assay. The approach is an unbiased way of identifying cell types in a mixed population of cells, both known and uncharacterised, in addition to the genes that define them. This has seen a surge of interest in dimensionality reduction methods—in particular t-SNE [26] and UMAP [27]—as users seek to optimise the visualisation of results (Fig 4A and 4B). These methods are constrained by issues associated with representing the underlying data structure, i.e. the relationships between cell groupings. When 10s of thousands of cells are analysed, a 2D plot space is limiting.
The structure of scRNA-Seq data is commonly represented using approaches such as (A) t-SNE and (B) UMAP as shown here for immune cells derived from the Tabula Muris dataset. However, the distance between data points and groups of data points is difficult to interpret. (C) Graphia enables the construction of cell-to-cell networks built on a similarity parameter. Here, the 48 most significant PCA values for each cell were first calculated and this PCA profile used to construct a correlation network. The plot bottom left of C, shows the PCA profiles of cells in the two largest cell clusters. To better show graph structure, a k-NN (k = 10) transformation was applied and outlier cells removed (r < 0.85 and node degree < 10, nodes coloured white). The graph comprises of 12,498 nodes (cells) and 143k edges. Cell clusters have been annotated as the cell types defined by the authors. (D) Shows a gene correlation network generated from these data by first calculating the average expression of genes within cell clusters and then calculating a correlation matrix from these values. (E) Plots show the average expression profile (y-axis) of a selection of gene-clusters across the aggregated cell-clusters (x-axis). The label gives the cluster number, e.g., C1, the number of genes within the cluster (966) and the association of the genes with a given biology or cell type.
An alternative approach is to treat single cell data as a graph, where nodes represent cells or genes, and edges the similarity between them. There are a number of measures that may be used to calculate the distance between cells or genes and here we discuss our currently favoured approach. The Tabula Muris dataset [28] includes a scRNA-Seq data from 20 different mouse tissues. For the purpose of this study we selected only data from tissue immune cells, as annotated by the authors. Preprocessing and quality control was performed as per the Tabula Muris pipeline (https://github.com/czbiohub/tabula-muris), producing a normalized dataset of 14,466 cells from 12 tissues. Principal component analysis (PCA) was conducted to reduce the gene profile of a cell to principal components (PC), with the 48 most significant PCs being considered (adj. P-value <0.05), based on Jackstraw permutations [29]. This file (cells as rows, PCs as columns) was loaded into Graphia and a graph was generated based on the Pearson correlation coefficient between the PC profile of cells. An initial network was generated applying the k-NN algorithm (k = 15), and outlier cells were removed. Outliers were defined as cells with poor correlations and connectivity with other cells of the same cell type (r < 0.85 and node degree < 10). Applying these filters produced a network that better separated known cell-types. Subsequently, the filtered 12,498 cell-to-cell network was clustered using the Louvain clustering algorithm [24] with a granularity of 0.8, identifying 36 cell clusters (Fig 4C).
Following the identification of cell clusters, it is generally of interest to identify gene markers and expression modules associated with different cell types. Due to the inherent noise in scRNA-Seq data, it is of limited use to construct gene networks based on the correlation between expression values of individual cells. One alternative is to construct gene correlation networks based on an aggregated gene expression value across cells from each cluster, i.e. the similarity between clusters of cells instead of individual cells. Accordingly, a matrix of the average gene expression values across the clusters defined above was loaded into Graphia. A network of strongly correlated genes (r > 0.85) was generated and gene coexpression modules identified using the Markov clustering (MCL) algorithm [23] with a granularity setting of 1.7 (Fig 4D). Graphia enables a dynamic and rapid exploration of these clusters, allowing a user to understand where a given cluster sits within the context of the entire graph, the identity of genes present within a given cluster and the profile of all or some of those genes across samples, in this case cell clusters (Fig 4E).
Case Study 3: Exploration of bacterial pangenome structure
Whole genome sequencing is now routine in many fields. One common use is in the characterisation of microbial species and public databases already hold tens of thousands of genome sequences for the best studied organisms. Graph-based methods and tools that support the visualisation and analysis of such data are well established. For instance, Bandage is a software package now widely used to visualise de novo assembly graphs of bacterial genomes [8]. PPanGGOLiN [30] and Panaroo [31] are graph-based pangenome clustering tools for the analysis of genomic diversity within a bacterial species (i.e. its pangenome), which can then be used to statistically classify genes according to their occurrence in the genomes.
Comparative analyses of bacterial sequences have revealed a high degree of genetic diversity between isolates of the same organism, leading to the concept of “core” genes present in all isolates and “accessory” genes present only in some isolates. The distribution and organisation of accessory genes has a significant impact on an organism’s ability to adapt to different hosts and niches, virulence and drugs. Fig 5 shows a graph generated from a previously published dataset of whole genome sequencing data of 778 Staphylococcus aureus isolates [32]. Genomes were annotated using Prokka (v1.13) [33] and their pangenome defined using the analysis pipeline PIRATE (v1.0.3, default settings used, 95% sequence similarity threshold) [34]. In this visualisation, nodes represent individual genes/gene variants and edges the syntenic relationships between them. When the visualisation is enhanced through making node/edge size and colour proportional to the number of genomes in which a gene is present or two genes are syntenic across the dataset, core regions of the genome become easy to identify (Fig 5A). These can also be collapsed to single nodes to simplify the graph. Similarly, areas of high variability within the pangenome are obvious (Fig 5B). Graphia can be used to identify specific nodes, for example the path of a single genome (RF122) can be shown in the context of the wider pangenome (Fig 5C) or used to explore smaller local variations, e.g. the quorum sensing locus agrABCD, which shows four variants when defined at this similarity threshold (Fig 5D and 5E). In principle, displaying variation between sequences as a network is applicable to any such data. This has relevance to “pan-reference” genomes for more complex species such as humans, as well as for showing other variation, such as clustering repeat regions [35] or alternative isoforms of transcripts [36].
(A) The full pangenome of 778 isolates. Nodes represent individual orthologous genes as identified by PIRATE. Node size is determined by the number of genomes in which a gene has been identified. Edges denote where two genes are syntenic, and their thickness is determined by the number of times this syntenic connection is observed across isolates. Syntenic stretches of core genes have been collapsed for clarity using the “Contract Edges” transform, and low confidence nodes and edges (n < 3) have been removed. Coloured by Weighted Louvain Clustering (granularity = 1.0). (B) Highly variable region (boxed area in A) with a high density of “phage-like” genes. Nodes and edges are sized and coloured by frequency. (C) Genes highlighted are all found in single S. aureus isolate, RF122. (D) the agrABCD locus coloured by gene-association clustering. Frequently, an alternative allele is not identified as being the same gene, but their position is strongly indicative of shared function. (E) the same locus coloured by gene identity.
Case Study 4: Analysis of human genome variation
Genome variation also occurs at the level of individual DNA base pairs, so called single nucleotide variants (SNVs). Genotypes can be scored in an individual in terms of their allele dosage, i.e. both being the same as the reference nucleotide (0), heterozygous (1) or homozygous for the variant (2). Calculation of the correlation across a range of these positions in a population of individuals creates a relationship matrix.
Fig 5 shows various views of data from the 1000 genomes project [37]. Here 23,675 SNVs from chromosome 22 were used to generate graphs of the relationships between individuals in this cohort and the SNVs themselves. These data represent the genomes of 2,504 individuals, who were selected to represent 26 distinct ethnic populations from five continents. The average correlation between the SNV profile of individuals is low and the graph shown in Fig 5A is constructed using a threshold of r = 0.238, a value at which most of the genomes in the cohort form one connected component. The k-NN algorithm [20] was applied to reduce the number of edges (from 400.4k to 5996, opening up the local structure) such that only the strongest three relationships between individuals were maintained. The topology of the graph is clearly strongly influenced by the ethnicity of individuals with discrete clusters being observed for all the five continental populations and in some cases individuals from certain countries or ethnicities showing a local grouping within this overall structure. Also visible from the graph are a number of closely related individuals (Fig 6Ai) and a number of instances where an individual does not co-occur with their annotated population, for example there are a number of South Americans in with the Africans, and vice versa (Fig 6B). Transposing the matrix to analyse the similarity between the profile of SNVs across the 2,504 individuals, at the threshold used here (r = 0.75), 11,600 SNVs formed 2,467 separate graph components of more than one node (Fig 6C). After clustering the graph using the Louvain algorithm [22], many of the clusters contain nearby SNVs; likely haplotype blocks, some of which were clearly associated with a given a population. Graph analysis represents an improved approach compared to e.g. PCA plots, visualising genetic associations between individuals and genetic variants.
(A) The graph shown was constructed from data from the 1000 genomes project based on the correlation (r threshold = 0.238) between the allele dosages at 23,675 SNVs from chromosome 22. Nodes represent the 2,504 individuals included in the study and edges the three most significant correlations with their neighbours (k-NN was applied where k = 3). In most cases, individuals’ group with others from the same continent although there are instances where this does not appear to be the case. Visualisation of edge weights (Ai) also highlights cases where individuals would appear to be closely related. (B) Colouring of nodes by the attribute ‘population’ provides a higher resolution to the graph and populations showing a high degree homogeneity have been labelled. (C) Transposing the data upon import demonstrates SNVs whose pattern across the genome covaries. Clustering of these data shows many to represent haplotype blocks and inspection of their profile across genomes, demonstrates some SNV clusters to be associated with a given ethnic grouping, e.g. cluster 3 (Africans) and cluster 14 (East Asia), whilst others little obvious association with ethnicity, e.g. cluster 6. Plots show the average score of SNV’s within a cluster (y-axis, 0,1,2), across the 2,504 individuals ordered by continent and then population (x-axis).
Discussion
Data-driven research is now a foundation of modern biomedical and agricultural sciences, due to continued growth in the size and complexity of biological datasets. Network analysis provides a flexible toolbox combining visualisation with the algorithmic analysis of data structure, for testing a broad range of hypotheses and hypothesis-free data explorations. Graphia is designed for the visualisation and analysis of large graphs. Originally, our interest in graph-based analysis was driven by our desire to analyse large correlation networks of transcriptomics data. The results of weighted gene coexpression analysis (WGCNA)3 are generally visualised as a tree diagram or heat-map. The precursor of Graphia, BioLayout Express3D [4,11], was developed specifically to generate and display transcriptomic data and pathway modelling [4,38,39]. BioLayout has been used in the analysis of many large transcriptomic datasets from multiple species [40–44]. It has also been applied to datasets that were not envisaged at the time, for example the relationship between symptoms of altitude sickness [45], the honey bee microbiome [46], comparing morphometric measurements of dog brains [47] and even naming patterns in historical birth records [48]. The addition of new functionality to BioLayout was however constrained by inherent limitations in the code structure and programming language (Java).
Graphia is an entirely new analytical platform developed using a modern UI framework (Qt) and programming language (C++). The correlation plugin reproduces and improves upon the functionality of Biolayout for the analysis of any high dimensional numerical matrix. Data visualisation is core to the functionality of Graphia. Good visualisations make it easier for a user to recognise patterns, trends, and outlier groups within data. The next step in an analysis is determined by insights gained from the interaction with the visualisation, whether that be the discovery of errors in the input data, data effects due to technical reasons, or from new and interesting discoveries. Graphia is designed to make best use of the latest accelerated graphics hardware, to make graph visualisations scalable but still responsive in real time. By default, graphs are rendered in 3D, where the visualisation and navigation of complex graph topologies is much enhanced; the additional dimension providing the ability to distinguish the distance between what might appear in 2D to be closely connected nodes. Another core aspect to the visualisation of data is the concept of graphs being ‘dynamic’, changing in real-time, as nodes/edges are added or removed. To achieve this, the layout algorithm runs continuously, unless manually paused. Dynamic transitions may become challenging when graph structure alters dramatically following a transformation, such as when a hub-node is deleted from a tree graph or one graph component fragments into many. If such a transformation is executed quickly, a user’s ‘mental map’ can be lost [49]. For this reason, Graphia includes the option to slow down the transition between one state and the next, and in addition orientates components ‘in flight’ prior to their reconnection. Indeed, the way in which Graphia handles graph components dynamically is quite unique.
The development of Graphia has been driven by the analytical challenges associated with data derived from the biological sciences, but it is designed as a general-purpose platform for the analysis of network data from any source. If the input data is in tabular form (continuous or discrete values), it can be used to build a graph. If data already exists in a graph format, Graphia provides a means to explore it. Graphia can load data from files, or from remote web resources, but in theory it could interface with a remote database. It is interesting to note the widespread adoption of graph databases. Not only do graph databases speed up and simplify querying of data stores, storage of data as a graph makes it easier to visualise and analyse. Whilst there are a growing number of web-based tools that support the querying and visualisation of graph databases, none possess the power of Graphia in rendering large portions of the data they store.
Here we offer a high-level view of the functionality of Graphia and some examples of its many uses within the biomedical sciences. We provide installers to allow it to run on all common desktop operating systems and access to the source code, to allow users to develop new functionality to enhance its functionality for their needs.
Supporting information
S1 Fig. Comparison of Gephi, Cytoscape and Graphia in terms of large graph loading, layout and rendering performance.
Three graphs were used for these tests, a correlation graph (panels a, d, g) generated from the GNF mouse gene expression atlas using Graphia (r = 0.7), then saved as a.gml file, and two graphs from an online repository https://chriswalshaw.co.uk/partition/#graphs - finan512 (panels b, e, h) and fe_pwt (panels c, f, i). These were selected to represent large graphs of different node/edge counts and structure. Each graph was opened using the three tools and the time taken to load and layout the graph shown in the panel recorded. In each case, a force directed layout algorithm similar to that employed by Graphia was used; for Gephi (v0.9.2) this was the Force Atlas 2 layout algorithm, and for Cytoscape (v3.9.0) the OpenCL Prefuse Layout algorithm. In different views shown, we have not attempted to fully optimise graph layout in each case but show the layouts at 10x the standard number of iterations for Gephi and Cytoscape. While these layouts completed relatively quickly, they were far from ‘finished’ in establishing a stable layout. By contrast, we ran Graphia’s (v3.0) dynamic layout to a point where layout was near optimal which took considerably longer. A notable difference between the network tools was in speed and fluidity of interaction with the graph visualisation, i.e. fps (frames per second) rates, Graphia being an estimated 3–4 times quicker than within Gephi or Cytoscape and in 3D. The specifications of the computer used for these tests are: Intel Core i7-4930K, 16Gb memory, Nvidia GeForce RTX 2060 Super, Windows 10 Pro (10.0.19043).
https://doi.org/10.1371/journal.pcbi.1010310.s001
(DOCX)
Acknowledgments
Graphia was originally designed and built by Kajeka Ltd., a University of Edinburgh spin-out company (2015–2020).
References
- 1. Luck K, Sheynkman GM, Zhang I, Vidal M. Proteome-Scale Human Interactomics. Trends in Biochemical Sciences. 2017 p. 342–54. pmid:28284537
- 2. Vidal M, Cusick ME, Barabási AL. Interactome networks and human disease. Cell. Cell; 2011 p. 986–98. pmid:21414488
- 3. Langfelder P, Horvath S. WGCNA: An R package for weighted correlation network analysis [Internet]. BMC Bioinformatics; 2008 9:559. Available from: https://pubmed.ncbi.nlm.nih.gov/19114008/. pmid:19114008
- 4. Freeman TC, Goldovsky L, Brosch M, Van Dongen S, Mazière P, Grocock RJ, et al. Construction, visualisation, and clustering of transcription networks from microarray expression data. PLoS Comput Biol. 2007 Oct;3(10):2032–42. pmid:17967053
- 5. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: A software Environment for integrated models of biomolecular interaction networks. Genome Res. Genome Res; 2003 Nov;13(11):2498–504. pmid:14597658
- 6. Bastian M, Heymann S, Jacomy M. Gephi: An Open Source Software for Exploring and Manipulating Networks. Int AAAI Conf Weblogs Soc Media. 2009;361–2.
- 7.
Auber D, Archambault D, Bourqui R, Delest M, Dubois J, Lambert A, et al. TULIP 5. In: Alhajj R, Rokne J, editors. Encyclopedia of Social Network Analysis and Mining [Internet]. Springer; 2017 Aug p. 1–28. Available from: https://hal.archives-ouvertes.fr/hal-01654518.
- 8. Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: Interactive visualization of de novo genome assemblies [Internet]. Bioinformatics [Internet]. 2015 [cited 2020 Jun 7];31(20):3350–2. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595904/?report=reader.
- 9. Ellson J, Ellson J, Gansner ER, Koutsofios E, North SC, Woodhull G. Graphviz and dynagraph–static and dynamic graph drawing tools [Internet]. GRAPH Draw Softw [Internet]. 2003 [cited 2020 Jul 5];127–148. Available from: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96.3776.
- 10.
Batagelj V, Mrvar A. Pajek—Analysis and visualization of large networks. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) [Internet]. Springer, Berlin, Heidelberg; 2002 [cited 2020 Jul 5] p. 477–8. Available from: http://link.springer.com/10.1007/3-540-45848-4_54.
- 11. Theocharidis A, van Dongen S, Enright AJ, Freeman TC. Network visualization and analysis of gene expression data using BioLayout Express(3D). [Internet]. Nat Protoc [Internet]. 2009 [cited 2020 Jun 7];4(10):1535–50. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19798086.
- 12.
Social Network Visualiser [Internet]. Available from: https://socnetv.org/.
- 13. Smith M, Ceni A, Milic-Frayling N, Shneiderman B, Mendes Rodrigues E, Leskovec J, et al. NodeXL: a free and open network overview, discovery and exploration add-in for Excel 2007/2010/2013/2016. Social Media Research Foundation [Internet]. 2010. Available from: https://www.smrfoundation.org.
- 14. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, et al. STRING: Known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005 Jan 1;33(DATABASE ISS.):D433–D437. pmid:15608232
- 15. Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q. GeneMANIA: A real-time multiple association network integration algorithm for predicting gene function [Internet]. Genome Biol [Internet]. Genome Biol; 2008 Jun 27 [cited 2020 Jul 5];9(SUPPL. 1). Available from: https://pubmed.ncbi.nlm.nih.gov/18613948/.
- 16.
Neo4J Bloom [Internet]. Available from: https://neo4j.com/bloom/.
- 17.
Chimani M, Gutwenger C, Jünger M, Klau GW, Klein K, Mutzel P. The Open Graph Drawing Framework (OGDF). In Handbook of Graph Drawing and Visualization. CRC Press; 2014.
- 18. Pratt D, Chen J, Welker D, Rivas R, Pillich R, Rynkov V, et al. NDEx, the Network Data Exchange [Internet]. Cell Syst [Internet]. Cell Press; 2015 Oct 28 [cited 2020 Jul 5];1(4):302–5. Available from: /pmc/articles/PMC4649937/?report = abstract.
- 19.
Rossi RA, Ahmed NK. The Network Data Repository with Interactive Graph Analytics and Visualization [Internet]. [cited 2020 Jul 5]. Available from: http://snap.stanford.edu/data/index.html.
- 20. Leskovec J, Sosič R. SNAP: A general-purpose network analysis and graph-mining library. ACM Trans Intell Syst Technol. Association for Computing Machinery; 2016 Jul 1;8(1).
- 21. Miryala SK, Anbarasu A, Ramaiah S. Discerning molecular interactions: A comprehensive review on biomolecular interaction databases and network analysis tools. Gene. Elsevier B.V.; 2018 p. 84–94. pmid:29129810
- 22. Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression [Internet]. Am Stat [Internet]. 1992 [cited 2020 Jul 5];46(3):175–85. Available from: https://ecommons.cornell.edu/bitstream/1813/31637/1/BU-1065-MA.pdf.
- 23. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research. Oxford University Press; 2002 p. 1575–84. pmid:11917018
- 24. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. IOP Publishing; 2008 Oct 9;2008(10):P10008.
- 25. Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, et al. Database resources of the National Center for Biotechnology Information. [Internet]. Nucleic Acids Res [Internet]. 2000 Jan 1 [cited 2020 Jul 5];28(1):10–4. Available from: http://www.ncbi.nlm.nih.gov/pubmed/10592169. pmid:10592169
- 26. Van Der Maaten L, Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008.
- 27. Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. Nature Publishing Group; 2019 Jan 1;37(1):38–47.
- 28. Schaum N, Karkanias J, Neff NF, May AP, Quake SR, Wyss-Coray T, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. Nature Publishing Group; 2018 Oct 18;562(7727):367–72. pmid:30283141
- 29. Chung NC. Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics. NLM (Medline); 2020 May 1;36(10):3107–14. pmid:32142108
- 30. Gautreau G, Bazin A, Gachet M, Planel R, Burlot L, Dubois M, et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol. Public Library of Science; 2020;16(3):e1007732.
- 31. Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline [Internet]. Genome Biol [Internet]. BioMed Central; 2020 Dec 22 [cited 2020 Jul 26];21(1):180. Available from: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02090-4.
- 32. Richardson EJ, Bacigalupe R, Harrison EM, Weinert LA, Lycett S, Vrieling M, et al. Gene exchange drives the ecological success of a multi-host bacterial pathogen [Internet]. Nat Ecol Evol [Internet]. Nature Publishing Group; 2018 Sep 1 [cited 2020 Jul 26];2(9):1468–78. Available from: https://pubmed.ncbi.nlm.nih.gov/30038246/.
- 33.
Seemann T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics. Oxford University Press; 2014 Jul 15;30(14):2068–9.
- 34. Bayliss SC, Thorpe HA, Coyle NM, Sheppard SK, Feil EJ. PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria [Internet]. Gigascience [Internet]. 2019 [cited 2020 Jul 5];8:giz119. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6785682/.
- 35. Novák P, Neumann P, Macas J. Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics. 2010 Jul 15;11. pmid:20633259
- 36. Nazarie FW, Shih B, Angus T, Barnett MW, Chen SH, Summers KM, et al. Visualization and analysis of RNA-Seq assembly graphs [Internet]. Nucleic Acids Res [Internet]. 2019 [cited 2020 Jul 5];47(14):7262–75. Available from: https://pubmed.ncbi.nlm.nih.gov/31305886/.
- 37. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, et al. A global reference for human genetic variation. Nature. Nature Publishing Group; 2015 p. 68–74. pmid:26432245
- 38. Theocharidis A, van Dongen S, Enright AJ, Freeman TC. Network visualization and analysis of gene expression data using BioLayout Express(3D). Nat Protoc. 2009;4(10):1535–50. pmid:19798086
- 39. O’Hara L, Livigni A, Theo T, Boyer B, Angus T, Wright D, et al. Modelling the Structure and Dynamics of Biological Pathways. PLoS Biol. Public Library of Science; 2016 Aug 10;14(8). pmid:27509052
- 40. Freeman TC, Ivens A, Baillie JK, Beraldi D, Barnett MW, Dorward D, et al. A gene expression atlas of the domestic pig. BMC Biol. 2012 Nov 15;10. pmid:23153189
- 41. Xue J, Schmidt S V., Sander J, Draffehn A, Krebs W, Quester I, et al. Transcriptome-Based Network Analysis Reveals a Spectrum Model of Human Macrophage Activation. Immunity. 2014 Feb 20;40(2):274–88. pmid:24530056
- 42. Patir A, Fraser AM, Barnett MW, McTeir L, Rainger J, Davey MG, et al. The transcriptional signature associated with human motile cilia [Internet]. Sci Rep [Internet]. 2020 Dec 2 [cited 2020 Jul 5];10(1):10814. Available from: http://www.nature.com/articles/s41598-020-66453-4.
- 43. Clark EL, Bush SJ, McCulloch MEB, Farquhar IL, Young R, Lefevre L, et al. A high resolution atlas of gene expression in the domestic sheep (Ovis aries). PLoS Genet. Public Library of Science; 2017 Sep 1;13(9).
- 44. Nirmal AJ, Regan T, Shih BB, Hume DA, Sims AH, Freeman TC. Immune cell gene signatures for profiling the microenvironment of solid tumors. Cancer Immunol Res. American Association for Cancer Research Inc.; 2018 Nov 1;6(11):1388–400. pmid:30266715
- 45. Hall DP, MacCormick IJC, Phythian-Adams AT, Rzechorzek NM, Hope-Jones D, Cosens S, et al. Network analysis reveals distinct clinical syndromes underlying acute mountain sickness [Internet]. PLoS One [Internet]. Public Library of Science; 2014 Jan 22 [cited 2020 Jul 5];9(1). Available from: https://pubmed.ncbi.nlm.nih.gov/24465370/.
- 46. Regan T, Barnett MW, Laetsch DR, Bush SJ, Wragg D, Budge GE, et al. Characterisation of the British honey bee metagenome. Nat Commun. Nature Publishing Group; 2018 Dec 1;9(1).
- 47. Rzechorzek NM, Saunders OM, Hiscox L V., Schwarz T, Marioni-Henry K, Argyle DJ, et al. Network analysis of canine brain morphometry links tumour risk to oestrogen deficiency and accelerated brain ageing. Sci Rep. Nature Publishing Group; 2019 Dec 1;9(1). pmid:31467332
- 48. Bush SJ, Powell-Smith A, Freeman TC. Network analysis of the social and demographic influences on name choice within the UK (1838–2016). PLoS One. Public Library of Science; 2018 Oct 1;13(10).
- 49. Archambault D, Purchase H, Pinaud B. Animation, small multiples, and the effect of mental map preservation in dynamic graphs [Internet]. IEEE Trans Vis Comput Graph [Internet]. IEEE Trans Vis Comput Graph; 2011 [cited 2020 Jul 5];17(4):539–52. Available from: https://pubmed.ncbi.nlm.nih.gov/20498503/. pmid:20498503