Graphia: A platform for the graph-based visualisation and analysis of high dimensional data

Graphia is an open-source platform created for the graph-based analysis of the huge amounts of quantitative and qualitative data currently being generated from the study of genomes, genes, proteins metabolites and cells. Core to Graphia’s functionality is support for the calculation of correlation matrices from any tabular matrix of continuous or discrete values, whereupon the software is designed to rapidly visualise the often very large graphs that result in 2D or 3D space. Following graph construction, an extensive range of measurement algorithms, routines for graph transformation, and options for the visualisation of node and edge attributes are available, for graph exploration and analysis. Combined, these provide a powerful solution for the interpretation of high-dimensional data from many sources, or data already in the form of a network or equivalent adjacency matrix. Several use cases of Graphia are described, to showcase its wide range of applications in the analysis biological data. Graphia runs on all major desktop operating systems, is extensible through the deployment of plugins and is freely available to download from https://graphia.app/.


Introduction
The study of interactions between entities is a cornerstone of modern analytics. In biology, efforts to map the 'interactome'-all the interactions between the components of a biological system-have been underway for some time, generated by a number of complementary approaches [1,2]. Networks of biological data may also be used to chart diverse phenomena such as the spread of disease, the interactions between drugs and their targets, and the evolutionary relationships between species. Many data from other sectors are also inherently graphbased in structure. For example, interactions on social media platforms, customer/client relationships, communication and transport systems, computer networks and many other realworld systems. Matrices of numerical data that do not inherently possess a network structure can also be analysed using graph-based approaches. Wherever it is possible to calculate the distance between entities, a graph can be constructed using high confidence measures to define the edges between entities, represented by nodes. In biology, such an approach is already widely used to analyse high dimensional data, in particular to construct and analyse gene coexpression networks [3,4], but the approach is applicable to any numerical or categorical data from any source.
Given the explosion in the availability of data in recent years and the potential to visualise and analyse it using graph-based approaches, a variety of software tools to support these activities have been developed. In biology, Cytoscape [5] is perhaps the most widely used software for performing graph analytics. It has a large user base and supports many 'apps' (plugins) created by the community for the performance of specific graph-based analysis tasks. Other network visualisation and analysis tools include; Gephi [6], Tulip [7], Bandage [8], Graphviz [9], Pajek [10], yEd (yFiles, Tübingen, Germany), BioLayout [4,11], Social Network Visualiser [12] and NodeXL [13]. We have compared a number of the most commonly used network analyses tools used in Table 1 and in the case of Gephi and Cytoscape compared some of the key aspects of graph visualisation (S1 Fig). There are also a range of web-based software tools exclusively designed to visualise portions of data, often from a designated database, such as String [14], GeneMania [15] and Neo4J Bloom [16]. Some of these tools are focused on supporting a particular community, whilst others possess functionality tailored towards specific tasks or data types and include a mix of open-source projects and commercial tools. Others provide opensource code repositories for graph visualisation and analysis algorithms [17], or share repositories of graph data [18][19][20]. For a more comprehensive review of network analysis tools and resources, see [21].
Despite the availability of a wide range of downloadable applications, web-resources and code libraries to support graph-based analyses, there is a pressing need for easy-to-use software that supports the rapid visualisation and analysis of relatively massive networks. To address this need, we developed Graphia-a general purpose graph analysis tool that supports the integration, visualisation, analysis, and interpretation of a wide variety of data types. Here we provide an overview of Graphia's core functionality for the analysis of graphs and describe a number of case studies in which it is applied to solve problems associated with the analysis of data derived from the biological sciences.

Design criteria
The following features were considered core to the design of Graphia: • Data and operating system agnostic. Import data from any source saved in standard file formats. The software should run on all major desktop operating systems and modern hardware configurations.
• Fast and scalable. Support the rapid loading of data, fast computation of graph layout and analysis algorithms, high quality data visualisations. Deliver smooth and responsive graphical rendering of millions of data points (node/edges) on standard desktop hardware.
• Dynamic rendering. Visualise in real time changes to the graph structure associated with alterations in input parameters or additional data. • 3D graph visualisation. Provide a navigable and immersive environment in which to explore and interpret large and complex graph topologies.
• Correlation graphs as an essential function. Rapidly convert any numerical or categorical data table into a correlation graph, supporting pattern finding and data mining.
• Attribute handling and visualisation. Visualise attributes (metadata) associated with nodes and edges using colour, size and text to distinguish between attribute values.
• Advanced analysis capabilities. Support a wide range of analytical algorithms and approaches that empower a user to explore, query and interpret data, such as the k-NN algorithm [22] for edge pruning, and the MCL [23] and Louvain [24] graph clustering algorithms.
• Extensible. Provide extensible architecture through use of a plugin system to allow the core to be extended or adapted for specific application areas or data types.
• User Interface. Provide a simple and intuitive user interface (UI) that is easy to navigate, featuring a graph display area supplemented with a table listing selected nodes and associated attributes and data values. The UI should provide easy access to menus providing functionality and display active transformations and visualisations (Fig 1).

Code architecture
Graphia is written in C++17 and is built upon Qt version 5, the cross-platform widget toolkit. For graphics, the industry standard OpenGL is used. The minimum driver support required is version 3.3 core profile, but more modern extensions will be used if they are available. Various open source libraries are employed, mostly for loading external data formats. These libraries and their associated licenses are enumerated in the About dialog of the application, accessible from the Help menu. Graphia is architected so that loading, and data type specific user interfaces are confined to plugins. These are independent modules to the core application and can be removed or added without affecting any base functionality.
At the highest level the code is organised hierarchically into four separate directories: • app-the core application code • plugins-the existing bundled plugins • shared-code used by both the core and plugins (this includes interface headers) • thirdparty-any library code not authored locally These are further divided into subdirectories dealing with specific areas of functionality. Graphia has been developed using standard object-orientated best practices. Continuous integration is employed to prevent portability build regressions, using recent versions of the compilers GCC, clang and MSVC. In addition, static analysis tools such as clang-tidy and cppcheck are used to identify potential problems early. CMake is used as a build system, and is set up for Linux, Windows and macOS compilation.

Implementation
Data import. Graphia has been designed to import data encoded in a variety of standard and non-standard file formats. These include standard graph-based file formats such as Bio-PAX OWL ontology (.owl), JSON graph (.json), GraphML (.graphml), Graph Modelling Language (.gml), Cytoscape Exchange Files (.cx/.cx2) and MATLAB data file (.mat) formats, but also unstanderdised formats such as edge lists, adjacency matrices, and tabular data prepared for correlation analyses. Using these file formats, a wide variety of data may be imported into Graphia, not only in terms of defining the nodes and edges of a graph, but user-defined attributes or metadata.
Fast and scalable. Existing graph visualisation tools either fail to render very large graphs effectively, or the ability to interact with a graph once rendered is limited and slow. Therefore, all aspects of Graphia's functionality have been engineered to run quickly. Graphia can render graphs millions of data points on relatively commonplace hardware, where interaction with them is fast and fluid. This has been achieved through the use of optimal coding practices and parallelisation of computationally intensive analysis routines, e.g. calculation of correlation matrices, graph layout, clustering.
Dynamic graph layout and rendering. Graph layout is an iterative process. Many programs only display the results of a layout algorithm after it has run a defined number of iterations. With Graphia, the layout is shown live, such that graphs 'unfold' in real time. However, the true power of dynamic graphs is realised when a transformation operation is performed or following the addition of new data. These changes are immediately reflected in the appearance of the graph. As graphs change dynamically, there is a need to identify the graph components and map how they interact when construction parameters are adjusted. The ability to quickly identify and move between components is a unique feature of Graphia. Components are rendered in a concentric pattern, arranged large-to-small. Smaller components may be filtered away using a transform. Most existing network analysis tools render graphs in 2D. The layout algorithm implemented in Graphia is innovative in that it applies current force-directed layout techniques, but in a dynamic fashion. Graphia renders graphs in 3D or 2D, making use of modern graphics hardware to display extremely large graphs efficiently with options for node/ edge shading, relative node sizing and spacing (Fig 2A-2F). When graphs are relatively small

PLOS COMPUTATIONAL BIOLOGY
or there is a need to share images by a conventional medium, i.e. a document, 2D graph visualisations have advantages. However, 2D visualisations are limiting when there is a need to display and interact with large graphs with complex topologies.
Attribute-to-visual mapping. Attributes are data values associated with nodes/edges. These can be user-defined or calculated by Graphia. For example, a node representing a person may be associated with knowledge of their gender, occupation, socioeconomic class, ethnicity, etc. (categorical attributes), as well as their height, age, weight, years in employment (numerical attributes). Colour can be used to represent categorical attributes, with nodes sharing the same attribute being assigned the same colour. In the case of numerical attributes, colour and size can be used to represent the value according to a spectrum, e.g. from small white nodes to big red nodes to represent low and high values, respectively. Both types of attribute may also be calculated from the graph itself, e.g. the assignment of nodes to clusters or calculation of node degree, PageRank values etc. (Fig 2G-2I). Visualising attribute values may help explain graph structure, for instance an area of a graph might be visibly associated with nodes of a given attribute. Attributes may also be used to analyse the statistical associations with graph topology, for example, graph clusters may be analysed for the enrichment of nodes with a given attribute. Attribute values may also be specific to only a single element, e.g. a unique node name.

Results
Described below are a number of use cases for Graphia in the context of biological data.

Case Study 1: Visualisation of phylogenetic trees
Hierarchical data structures are often represented by tree graphs and used in biology to represent relationships between species, strains, samples or genes. While trees are an intuitive way of visualising such relationships, when the number of branches on the tree become large, the ability to display such graphs at a local or global level is challenging. Here we show two examples of taxonomic trees visualised by Graphia representing the different levels of phylogeny, from a central node representing the class of organisms, up through branches representing the order, family, genus, with species and subspecies being the leaves of the tree. The examples described are taxonomic trees for all mammals and insects (Fig 3), as defined by the NCBI Taxonomy database [25]. The taxonomic tree of mammals consists of 9843 nodes and 9862 edges and is shown in 2D with nodes coloured by type, i.e. what level of the taxonomic tree they represent (Fig 3A). A small section associated with apes is highlighted (Fig 3B). When the graph is loaded using the WebSearch plugin, selection of a single node automatically initiates a search for the name of the selected node (Fig 3C). Shown in Fig 3D is a taxonomic tree of all insect species, consisting of 275,328 nodes and 275,528 edges displayed in 2D. Graphia provides a unique environment to search, cluster and explore such data, the third dimension greatly assists in visualising the structure of such large graphs.

Case Study 2: Analysis of single cell transcriptomics data
Single cell RNA sequencing (scRNA-Seq) generates gene expression profiles for thousands of individual cells in a single assay. The approach is an unbiased way of identifying cell types in a mixed population of cells, both known and uncharacterised, in addition to the genes that define them. This has seen a surge of interest in dimensionality reduction methods-in particular t-SNE [26] and UMAP [27]-as users seek to optimise the visualisation of results (Fig 4A  and 4B). These methods are constrained by issues associated with representing the underlying data structure, i.e. the relationships between cell groupings. When 10s of thousands of cells are analysed, a 2D plot space is limiting.
An alternative approach is to treat single cell data as a graph, where nodes represent cells or genes, and edges the similarity between them. There are a number of measures that may be used to calculate the distance between cells or genes and here we discuss our currently favoured approach. The Tabula Muris dataset [28] includes a scRNA-Seq data from 20 different mouse tissues. For the purpose of this study we selected only data from tissue immune cells, as annotated by the authors. Preprocessing and quality control was performed as per the Tabula Muris pipeline (https://github.com/czbiohub/tabula-muris), producing a normalized dataset of 14,466 cells from 12 tissues. Principal component analysis (PCA) was conducted to reduce the gene profile of a cell to principal components (PC), with the 48 most significant PCs being considered (adj. P-value <0.05), based on Jackstraw permutations [29]. This file (cells as rows, PCs as columns) was loaded into Graphia and a graph was generated based on the Pearson correlation coefficient between the PC profile of cells. An initial network was generated applying the k-NN algorithm (k = 15), and outlier cells were removed. Outliers were defined as cells with poor correlations and connectivity with other cells of the same cell type

PLOS COMPUTATIONAL BIOLOGY
(r < 0.85 and node degree < 10). Applying these filters produced a network that better separated known cell-types. Subsequently, the filtered 12,498 cell-to-cell network was clustered using the Louvain clustering algorithm [24] with a granularity of 0.8, identifying 36 cell clusters (Fig 4C).
Following the identification of cell clusters, it is generally of interest to identify gene markers and expression modules associated with different cell types. Due to the inherent noise in scRNA-Seq data, it is of limited use to construct gene networks based on the correlation between expression values of individual cells. One alternative is to construct gene correlation

PLOS COMPUTATIONAL BIOLOGY
networks based on an aggregated gene expression value across cells from each cluster, i.e. the similarity between clusters of cells instead of individual cells. Accordingly, a matrix of the average gene expression values across the clusters defined above was loaded into Graphia. A network of strongly correlated genes (r > 0.85) was generated and gene coexpression modules identified using the Markov clustering (MCL) algorithm [23] with a granularity setting of 1.7 (Fig 4D). Graphia enables a dynamic and rapid exploration of these clusters, allowing a user to understand where a given cluster sits within the context of the entire graph, the identity of genes present within a given cluster and the profile of all or some of those genes across samples, in this case cell clusters (Fig 4E).

Case Study 3: Exploration of bacterial pangenome structure
Whole genome sequencing is now routine in many fields. One common use is in the characterisation of microbial species and public databases already hold tens of thousands of genome sequences for the best studied organisms. Graph-based methods and tools that support the visualisation and analysis of such data are well established. For instance, Bandage is a software package now widely used to visualise de novo assembly graphs of bacterial genomes [8]. PPanGGOLiN [30] and Panaroo [31] are graph-based pangenome clustering tools for the analysis of genomic diversity within a bacterial species (i.e. its pangenome), which can then be used to statistically classify genes according to their occurrence in the genomes.
Comparative analyses of bacterial sequences have revealed a high degree of genetic diversity between isolates of the same organism, leading to the concept of "core" genes present in all isolates and "accessory" genes present only in some isolates. The distribution and organisation of accessory genes has a significant impact on an organism's ability to adapt to different hosts and niches, virulence and drugs. Fig 5 shows a graph generated from a previously published dataset of whole genome sequencing data of 778 Staphylococcus aureus isolates [32]. Genomes were annotated using Prokka (v1.13) [33] and their pangenome defined using the analysis pipeline PIRATE (v1.0.3, default settings used, 95% sequence similarity threshold) [34]. In this visualisation, nodes represent individual genes/gene variants and edges the syntenic relationships between them. When the visualisation is enhanced through making node/edge size and colour proportional to the number of genomes in which a gene is present or two genes are syntenic across the dataset, core regions of the genome become easy to identify (Fig 5A). These can also be collapsed to single nodes to simplify the graph. Similarly, areas of high variability within the pangenome are obvious (Fig 5B). Graphia can be used to identify specific nodes, for example the path of a single genome (RF122) can be shown in the context of the wider pangenome (Fig 5C) or used to explore smaller local variations, e.g. the quorum sensing locus agrABCD, which shows four variants when defined at this similarity threshold (Fig 5D and  5E). In principle, displaying variation between sequences as a network is applicable to any such data. This has relevance to "pan-reference" genomes for more complex species such as humans, as well as for showing other variation, such as clustering repeat regions [35] or alternative isoforms of transcripts [36].

Case Study 4: Analysis of human genome variation
Genome variation also occurs at the level of individual DNA base pairs, so called single nucleotide variants (SNVs). Genotypes can be scored in an individual in terms of their allele dosage, i.e. both being the same as the reference nucleotide (0), heterozygous (1) or homozygous for the variant (2). Calculation of the correlation across a range of these positions in a population of individuals creates a relationship matrix.   [37]. Here 23,675 SNVs from chromosome 22 were used to generate graphs of the relationships between individuals in this cohort and the SNVs themselves. These data represent the genomes of 2,504 individuals, who were selected to represent 26 distinct ethnic populations from five continents. The average correlation between the SNV profile of individuals is low and the graph shown in Fig 5A is constructed using a threshold of r = 0.238, a value at which most of the genomes in the cohort form one connected component. The k-NN algorithm [20] was applied to reduce the number of edges (from 400.4k to 5996, opening up the local structure) such that only the strongest three relationships between individuals were maintained. The topology of the graph is clearly strongly influenced by the ethnicity of individuals with discrete clusters being observed for all the five continental populations and in some cases individuals from certain countries or ethnicities showing a local grouping within this overall structure. Also visible from the graph are a number of closely related individuals (Fig 6Ai) and a number of instances where an individual does not co-occur with their annotated population, for example there are a number of South Americans in with the Africans, and vice versa (Fig 6B). Transposing the matrix to analyse the similarity between the profile of SNVs across the 2,504 individuals, at the threshold used here (r = 0.75), 11,600 SNVs formed 2,467 separate graph components of more than one node ( Fig  6C). After clustering the graph using the Louvain algorithm [22], many of the clusters contain nearby SNVs; likely haplotype blocks, some of which were clearly associated with a given a population. Graph analysis represents an improved approach compared to e.g. PCA plots, visualising genetic associations between individuals and genetic variants.

Discussion
Data-driven research is now a foundation of modern biomedical and agricultural sciences, due to continued growth in the size and complexity of biological datasets. Network analysis provides a flexible toolbox combining visualisation with the algorithmic analysis of data structure, for testing a broad range of hypotheses and hypothesis-free data explorations. Graphia is designed for the visualisation and analysis of large graphs. Originally, our interest in graphbased analysis was driven by our desire to analyse large correlation networks of transcriptomics data. The results of weighted gene coexpression analysis (WGCNA) 3 are generally visualised as a tree diagram or heat-map. The precursor of Graphia, BioLayout Express 3D [4,11], was developed specifically to generate and display transcriptomic data and pathway modelling [4,38,39]. BioLayout has been used in the analysis of many large transcriptomic datasets from multiple species [40][41][42][43][44]. It has also been applied to datasets that were not envisaged at the time, for example the relationship between symptoms of altitude sickness [45], the honey bee microbiome [46], comparing morphometric measurements of dog brains [47] and even naming patterns in historical birth records [48]. The addition of new functionality to BioLayout was however constrained by inherent limitations in the code structure and programming language (Java).
Graphia is an entirely new analytical platform developed using a modern UI framework (Qt) and programming language (C++). The correlation plugin reproduces and improves and their thickness is determined by the number of times this syntenic connection is observed across isolates. Syntenic stretches of core genes have been collapsed for clarity using the "Contract Edges" transform, and low confidence nodes and edges (n < 3) have been removed. Coloured by Weighted Louvain Clustering (granularity = 1.0). (B) Highly variable region (boxed area in A) with a high density of "phage-like" genes. Nodes and edges are sized and coloured by frequency. (C) Genes highlighted are all found in single S. aureus isolate, RF122. (D) the agrABCD locus coloured by gene-association clustering. Frequently, an alternative allele is not identified as being the same gene, but their position is strongly indicative of shared function. (E) the same locus coloured by gene identity.
https://doi.org/10.1371/journal.pcbi.1010310.g005 upon the functionality of Biolayout for the analysis of any high dimensional numerical matrix. Data visualisation is core to the functionality of Graphia. Good visualisations make it easier for a user to recognise patterns, trends, and outlier groups within data. The next step in an analysis is determined by insights gained from the interaction with the visualisation, whether that be the discovery of errors in the input data, data effects due to technical reasons, or from new and interesting discoveries. Graphia is designed to make best use of the latest accelerated graphics The graph shown was constructed from data from the 1000 genomes project based on the correlation (r threshold = 0.238) between the allele dosages at 23,675 SNVs from chromosome 22. Nodes represent the 2,504 individuals included in the study and edges the three most significant correlations with their neighbours (k-NN was applied where k = 3). In most cases, individuals' group with others from the same continent although there are instances where this does not appear to be the case. Visualisation of edge weights (Ai) also highlights cases where individuals would appear to be closely related. (B) Colouring of nodes by the attribute 'population' provides a higher resolution to the graph and populations showing a high degree homogeneity have been labelled. (C) Transposing the data upon import demonstrates SNVs whose pattern across the genome covaries. Clustering of these data shows many to represent haplotype blocks and inspection of their profile across genomes, demonstrates some SNV clusters to be associated with a given ethnic grouping, e.g. cluster 3 (Africans) and cluster 14 (East Asia), whilst others little obvious association with ethnicity, e.g. cluster 6. Plots show the average score of SNV's within a cluster (y-axis, 0,1,2), across the 2,504 individuals ordered by continent and then population (x-axis). https://doi.org/10.1371/journal.pcbi.1010310.g006

PLOS COMPUTATIONAL BIOLOGY
hardware, to make graph visualisations scalable but still responsive in real time. By default, graphs are rendered in 3D, where the visualisation and navigation of complex graph topologies is much enhanced; the additional dimension providing the ability to distinguish the distance between what might appear in 2D to be closely connected nodes. Another core aspect to the visualisation of data is the concept of graphs being 'dynamic', changing in real-time, as nodes/ edges are added or removed. To achieve this, the layout algorithm runs continuously, unless manually paused. Dynamic transitions may become challenging when graph structure alters dramatically following a transformation, such as when a hub-node is deleted from a tree graph or one graph component fragments into many. If such a transformation is executed quickly, a user's 'mental map' can be lost [49]. For this reason, Graphia includes the option to slow down the transition between one state and the next, and in addition orientates components 'in flight' prior to their reconnection. Indeed, the way in which Graphia handles graph components dynamically is quite unique.
The development of Graphia has been driven by the analytical challenges associated with data derived from the biological sciences, but it is designed as a general-purpose platform for the analysis of network data from any source. If the input data is in tabular form (continuous or discrete values), it can be used to build a graph. If data already exists in a graph format, Graphia provides a means to explore it. Graphia can load data from files, or from remote web resources, but in theory it could interface with a remote database. It is interesting to note the widespread adoption of graph databases. Not only do graph databases speed up and simplify querying of data stores, storage of data as a graph makes it easier to visualise and analyse. Whilst there are a growing number of web-based tools that support the querying and visualisation of graph databases, none possess the power of Graphia in rendering large portions of the data they store.
Here we offer a high-level view of the functionality of Graphia and some examples of its many uses within the biomedical sciences. We provide installers to allow it to run on all common desktop operating systems and access to the source code, to allow users to develop new functionality to enhance its functionality for their needs.
Supporting information S1 Fig. Comparison of Gephi, Cytoscape and Graphia in terms of large graph loading, layout and rendering performance. Three graphs were used for these tests, a correlation graph (panels a, d, g) generated from the GNF mouse gene expression atlas using Graphia (r = 0.7), then saved as a.gml file, and two graphs from an online repository https://chriswalshaw.co.uk/ partition/#graphs -finan512 (panels b, e, h) and fe_pwt (panels c, f, i). These were selected to represent large graphs of different node/edge counts and structure. Each graph was opened using the three tools and the time taken to load and layout the graph shown in the panel recorded. In each case, a force directed layout algorithm similar to that employed by Graphia was used; for Gephi (v0.9.2) this was the Force Atlas 2 layout algorithm, and for Cytoscape (v3.9.0) the OpenCL Prefuse Layout algorithm. In different views shown, we have not attempted to fully optimise graph layout in each case but show the layouts at 10x the standard number of iterations for Gephi and Cytoscape. While these layouts completed relatively quickly, they were far from 'finished' in establishing a stable layout. By contrast, we ran Graphia's (v3.0) dynamic layout to a point where layout was near optimal which took considerably longer. A notable difference between the network tools was in speed and fluidity of interaction with the graph visualisation, i.e. fps (frames per second) rates, Graphia being an estimated 3-4 times quicker than within Gephi or Cytoscape and in 3D. The specifications of the computer used for these tests are: Intel Core i7-4930K, 16Gb memory, Nvidia GeForce RTX 2060 Super,