ViralLink: An integrated workflow to investigate the effect of SARS-CoV-2 on intracellular signalling and regulatory pathways

The SARS-CoV-2 pandemic of 2020 has mobilised scientists around the globe to research all aspects of the coronavirus virus and its infection. For fruitful and rapid investigation of viral pathomechanisms, a collaborative and interdisciplinary approach is required. Therefore, we have developed ViralLink: a systems biology workflow which reconstructs and analyses networks representing the effect of viruses on intracellular signalling. These networks trace the flow of signal from intracellular viral proteins through their human binding proteins and downstream signalling pathways, ending with transcription factors regulating genes differentially expressed upon viral exposure. In this way, the workflow provides a mechanistic insight from previously identified knowledge of virally infected cells. By default, the workflow is set up to analyse the intracellular effects of SARS-CoV-2, requiring only transcriptomics counts data as input from the user: thus, encouraging and enabling rapid multidisciplinary research. However, the wide-ranging applicability and modularity of the workflow facilitates customisation of viral context, a priori interactions and analysis methods. Through a case study of SARS-CoV-2 infected bronchial/tracheal epithelial cells, we evidence the functionality of the workflow and its ability to identify key pathways and proteins in the cellular response to infection. The application of ViralLink to different viral infections in a context specific manner using different available transcriptomics datasets will uncover key mechanisms in viral pathogenesis.

Introduction By the end of June 2020 roughly 23,500 scientific publications were released relating to Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) and the disease it causes (COVID-19) [1]. This fast uptake in research efforts is vital to decrease the health and economic impacts of this new pandemic. However, many questions remain unanswered regarding the molecular processes driving host responses to this coronavirus. One key challenge to utilisation of new findings is that published datasets are mostly unlinked to each other (due to parallel efforts by multiple research groups) and not always connected to community standard resources. An integrated and reusable method to interactively capture new data and connect it to existing data sources is needed. Such a comprehensive approach that can be run regularly when relevant new data is available, will increase and update our understanding of the mechanistic details of the SARS-CoV-2 infection. Further, it will aid drug target discovery by enabling identification of high confidence mediators through which the virus is affecting host cells [2]. Studying the effect of the virus at molecular level may explain the variety of clinical manifestations of the infection and the differences in susceptibility between different populations, and together with soon available human genomics data, could be used for identifying risk factors.
Upon entry of a virus into a human cell via surface receptors, viral RNA is released and translated into proteins [3]. In addition to their role in direct viral replication, these proteins are able to bind to human proteins creating a host-virus interface [4]. This interaction can lead to downstream signalling changes in the host cell, either as a result of viral hijacking or through a defined viral immune response by the host cell [5]. Ultimately, this signal flow results in intracellular gene transcription changes, cell-cell signalling and systemic host responses which drive the tug-of-war between the host and the virus [6]. In order to understand and control this conflict, it is necessary to study each of these levels of host response in detail, including the intracellular response of the primarily infected cell.
Currently available data relating to intracellular SARS-CoV-2 infection includes human binding partners of viral proteins [4,7,8] and transcriptomics datasets from infected cell lines [9][10][11], infected patients [12,13] and other infected animals [11,14]. Interdisciplinary and collaborative science can maximise the value of each of these datasets through data integration and comparison combined with application of different computational analysis approaches. One such computational analysis method is the utilisation of network approaches to model molecular interactions between the virus and human proteins as well as within and between human cells [15]. Network approaches have already been applied to study SARS-CoV-2 pathogenesis and to predict drug repurposing candidates and master regulators based on proteins in proximity to human binding proteins (which physically associate with SARS-CoV-2 proteins) [16][17][18][19][20].
Here we present a systems biology workflow built on our previously published resource MicrobioLink, which studies the effect of viral infections on host cells [21]. ViralLink reconstructs and analyses a causal molecular interaction network whose signal starts with the binding of an intracellular viral protein to a human protein, travels via multiple signalling pathways, and ends at the transcriptional regulation of altered genes. Subsequently, the workflow investigates the causal network using betweenness centrality measures, cluster analysis, functional overrepresentation analysis and network visualisation. Using currently available datasets from SARS-CoV-2 infected bronchial epithelial cells we demonstrate that this workflow can identify biologically relevant signalling pathways and predict key proteins for potential drug interventions. The workflow is built in a modular, standardised and updateable fashion and requires only limited programming ability to run. ViralLink can be applied easily to new SARS-CoV-2 related datasets or datasets from other viruses, to study the effect of the virus on host signalling and regulation in diverse contexts-including different cell types, patients and viral strains.

ViralLink workflow overview
The ViralLink workflow investigates the effect of viral infection within cells by generating and analysing context-specific networks of intracellular signalling and regulatory molecular interactions. These networks link the intracellular binding of viral and human proteins to the transcriptional response of the infected cell (Fig 1). The context-specificity of the analysis is obtained through the choice of input transcriptomics datasets-it could refer to strain of virus, type of infected cell, severity of infection, age of host or any other context of interest. By default, the workflow is set up to analyse the intracellular effects of SARS-CoV-2, requiring only transcriptomics counts data as input and thus encouraging and enabling rapid multidisciplinary research. However, the wide-ranging applicability and modularity of the workflow facilitates customisation of viral context, a priori interactions and analysis methods. ViralLink contains three primary stages: 1) collection and input of data; 2) reconstruction of the network;

PLOS COMPUTATIONAL BIOLOGY
ViralLink: A workflow to study intracellular signalling response to viral infection and 3) investigation of results using functional analysis, clustering, centrality measures and visualisation.

ViralLink workflow methods
Collection and input of data. Reconstruction of causal networks using ViralLink requires four separate input datasets (Fig 1): viral protein-human binding protein interactions, a priori human protein-protein interactions (PPIs), a priori human transcription factor (TF)-target gene (TG) interactions and an unnormalised counts matrix from a gene expression experiment. By default, all data except the transcriptomics counts are provided automatically. However alternative input files can be provided if desired.
The default workflow uses SARS-CoV-2 protein-human binding protein interactions obtained from IntAct [22,23]. This data was reformatted to contain one row per molecular interaction with 2 columns of UniProt IDs: SARS-CoV-2 proteins and human binding proteins. Alternative viral-human PPIs can be provided using the same data format. The workflow assumes all viral-human interactions have an inhibiting action on the human protein, unless a third column named "sign" is present in the input file containing "+" for activatory and "-" for inhibitory interactions. In addition, data is provided with the workflow containing the gene names corresponding to each of the SARS-CoV-2 proteins, to enable easy interpretation of the reconstructed networks.
For a priori human interactions, the workflow obtains and uses integrated collections of PPI and TF-TG interactions from OmniPath and DoRothEA, respectively [24,25]. These interactions are obtained using the 'OmniPathR' R package [24,26] to download and filter signed and directed interactions. For DoRothEA, only high and medium confidence level interactions are used (confidence scores A-C). In contrast to importing static input files, this script enables the use of up-to-date interaction data. Alternative interaction data can be used with the workflow provided it has the same format: specifically, it must contain source and target UniProt IDs in the columns 'to' and 'from' and if the transcriptomics data uses gene symbols, the interaction data must additionally contain gene symbols in the columns 'source_genesymbol' and 'target_genesymbol'. Furthermore, the interactions must be directed and signed with the sign of the interaction given in the column 'consensus_stimulation' where the value '1' represents a stimulation and anything else represents an inhibition.
The aforementioned a priori interactions are contextualised using transcriptomics data from any study of interest which compares viral infected to uninfected human cells or tissues. Correspondingly, the workflow requires unnormalised counts data from a transcriptomics experiment (containing UniProt or gene symbols as IDs) and a corresponding mapping table which lists the sample IDs (from the headers of the counts table) in the 'sample_name' column and the 'test' or 'control' status of the sample in the 'condition' column. This mapping table is used to carry out differential expression of a test condition (e.g. infected) compared to a control condition (e.g. uninfected). An example expression dataset and mapping table are provided with the workflow.
To process the transcriptomics data, the workflow uses 'DESeq2' in R to normalise the counts and to carry out differential expression analysis [27]. Any genes passing the log2 fold change and adjusted p value cut-offs, based on the provided parameters (default 1 and 0.05, respectively), are classed as differentially expressed genes (DEGs). Following removal of all genes with count = 0, normalised log2 counts across all samples are fitted to a gaussian kernel [28]. All genes with expression values above mean minus three standard deviations are considered as expressed genes. Subsequently, context-specific human PPI and TF-TG interactions are generated by filtering only interactions where both interacting molecules are expressed.
File paths to all input datasets and associated parameters (such as desired log2 fold change cut off) are specified in the parameters text file which is read in by the workflow.
Network reconstruction. The reconstructed causal network contains three layers of interactions, which are obtained, by default, from the three a priori interaction resources: • Viral proteins interacting with human binding partners: from the SARS-CoV- 2  A list of all TFs targeting the differentially expressed genes are obtained from the contextspecific TF-TG interactions. The human binding proteins of viral proteins are connected to the listed TFs through the context-specific human PPIs using a network diffusion approach called Tied Diffusion Through Interacting Events (TieDIE) [29]. As inputs for the TieDIE tool, the following information is used: 1) The signed, directed and expression based filtered PPIs is used as the input network. 2) Human proteins which are interacting partners of the viral proteins are used as the start nodes. The number of viral proteins bound to each of the human proteins are assigned as the weights of the start nodes.
3) The TFs of the DEGs in the dataset are used as the stop nodes. The weights for each of the TFs in the set of stop nodes were calculated using the following formula (Eq 1) which considers both the log2 fold change of the DEGs as well as the sign (i.e. stimulatory or inhibitory) of the relationship between the TF and the DEG.
After running TieDIE, a custom R script is used to collate all the data into a final viral-initiated intracellular signalling network (causal network), outputting an edge table representation of the network, with a node table containing additional node annotations. Starting with the interactions output from TieDIE, viral protein-human binding protein interactions are added for each of the present human binding proteins. Similarly, TF-TG interactions (where the TG is a DEG) are added for each of the present TFs, creating a full network with three interaction types: SARS-CoV-2 protein-human binding protein, PPI and TF-DEG. All nodes of the network are added to a node table with annotations including heat values (output from TieDIE), Entrez IDs (obtained in R using the 'org.Hs.eg.db' package), gene symbols (obtained from UniProt [30]) and log2 fold change values from the differential expression analysis.
Network investigation. Following reconstruction of the causal network, ViralLink provides functionality to investigate the results using functional analysis, clustering, centrality measures and visualisation.
Centrality measures. To identify key molecules in the reconstructed network ViralLink uses a betweenness centrality measure-calculating the global importance of a node (in this case a protein) based on the number of shortest paths which pass through them when connecting all node pairs in the network [31]. Nodes with high betweenness centrality play a key role in transduction of signals through the network, and here represent proteins with biological importance in the cellular response to viral infection. Betweenness centrality is calculated for each node in the causal network using the R package 'igraph' and output as an annotation in the node table [32]. Alternative centrality measures are available using the 'igraph' package and can be integrated into the workflow by the user if required.
Functional analysis. To further investigate important cellular functions and signalling pathways directly affected by the virus of interest, ViralLink carries out functional overrepresentation analysis on different parts of the causal network: 1. The DEGs of the network 2. The upstream human proteins (including human binding proteins, intermediary signalling proteins and TFs)

Identified clusters (only those with � 15 nodes are investigated)
Functional overrepresentation analysis is carried out in R using packages 'ClusterProfiler' (for Gene Ontology annotations [33]) and 'ReactomePA' (for Reactome annotations [34][35][36]. For analysis of the upstream human signalling proteins and analysis of clusters, all proteins in the context-specific human PPI interactions are used as the background. For analysis of the DEGs, all target genes in the context-specific human TF-TG interactions are used as the background. For Gene Ontology (Biological Process) analysis (except when running the 'compareCluster' command), the 'simplify' command is used (select_fun = min) to remove redundant functions. All functions with q value � 0.05 are considered significantly overrepresented.
Furthermore, ViralLink also implements a network-based pathway enrichment analysis on the upstream human proteins of the network using the ANUBIX (Adaptive NUll distriButIon of X-talk) algorithm [37]. Here, limitations of overrepresentation analysis, such as the assumption of gene independency, are overcome by integrating knowledge of gene/protein associations using networks. Specifically, upstream human proteins of the reconstructed causal network are tested for enrichment of ANUBIX-provided Reactome and KEGG pathways using the context-specific human PPI interactions as the input network. All functions with q value � 0.05 are considered significantly overrepresented.
An additional R script is provided alongside the workflow which creates subnetworks of the causal network based on functions of interest. These function-specific subnetworks highlight how specific signalling pathways in the infected cell reach (and subsequently affect) specific functions of the DEGs. For example, the subnetwork could be created to show how viral proteins can affect different host toll-like receptor pathways, and how these pathways can ultimately affect DEGs associated with interleukins. In this network the DEG nodes would be replaced with nodes representing the interleukin functions (which must be overrepresented based on the functional analysis). This script requires the output files from the functional overrepresentation analysis, the node and edge tables of the causal network and a file of all UniProt IDs associated with all Reactome functions (which is provided with ViralLink, following download from the Reactome website in April 2020). In addition, the script requires a list of overrepresented DEG functions (Reactome) and a list of upstream signalling functions (Reactome) to visualise. The script outputs an edge table, a node table and a Cytoscape file.
Visualisation. Data visualisation is often an important part of biological network interpretation, providing new insights into the data and visually conveying analysis results [38]. As such, ViralLink has the capability to import reconstructed networks into the open-source Cytoscape network visualisation software [39,40]. Specifically, the workflow employs the 'RCy3' R package to interact with Cytoscape programmatically, importing the node and edge tables to create network visualisations and saving the data as a '.cys' file. The causal network, the network clusters (where containing � 15 nodes) and the function-specific networks are visualised in this way. If calculated previously, the causal network nodes are coloured based on their betweenness centrality, however further style and layout customisation must be carried out by the user directly based on the data.
Cluster analysis. Clustering algorithms are commonly used in network biology to investigate the complex structure of molecular interaction networks by extracting groups of densely connected molecules [41,42]. Depending on the number of molecules included, a cluster can represent a molecular complex or a group of molecules which function closely with each other. Cluster analysis can identify subsets of a large network with specific functions and indicate molecules that may have functional redundancy with each other-potentially having implications for drug targeting. ViralLink employs the MCODE clustering method to identify groups of densely connected nodes in PPI networks [41]. To carry out this analysis, ViralLink requires the Cytoscape software [39,40], which is controlled programmatically using the R package 'RCy3' with the Cytoscape 'MCODE' app (v1.6.1) [43]. MCODE is run using default parameters: degree cut off = 2, haircut = TRUE, node score cut off = 0.2, k-core = 2, max depth = 100. This analysis outputs the data as node annotations in the node table, which are used for the functional analysis and visualisation steps of the workflow.

Implementation
The workflow consists of modular R and Python scripts which can be run in three separate ways: through a Docker container, through the provided Python wrapper script or as separate scripts for each step of the workflow. The Docker implementation is recommended to users as it does not require installation of R or Python or any of the necessary packages. The Python wrapper script and separate scripts are provided for more advanced users to enable bespoke analyses. If running ViralLink for the study of SARS-CoV-2, the only required input files are related to the transcriptomics data of interest: a raw counts table (using gene symbols or Uni-Prot protein IDs) and a two-column metadata table specifying test and control sample IDs. One further script is provided to generate function-specific networks. This script is not included in the Docker container or Python wrapper because it requires the user to specify functions of interest from the output of the functional analysis. The only file the user needs to edit is the parameters text file where input file paths and parameters are specified.

Use case: SARS-CoV-2 infection of lung cells
Collection and input of data. To demonstrate the application of this workflow for the study of SARS-CoV-2, we created intracellular signalling networks of NHBE cells (from Normal Human Bronchial/tracheal Epithelial cell lines) upon infection with SARS-CoV-2 based on data published by Blanco Melo et al. [11] and viral-human binding protein interactions published by Gordon et al. [4].
Network reconstruction. The resulting causal network contains 804 nodes (molecules) and 5423 interactions (Fig 2A and S1 and S2 Tables and S1 Data).
Network investigation. Centrality measures. The 10 most central proteins of the reconstructed causal network (based on betweenness centrality) are involved in a wide range of cellular functions (Fig 2B). Taken together these proteins highlight the propensity for SARS-CoV-2 to affect cell proliferation, apoptosis, cell adhesion, exocytosis and proinflammatory immune responses. These functions are influenced through multiple cellular pathways, most notably MAPK/ERK and PI3K/AKT signalling pathways.
Functional analysis. Functional overrepresentation analysis identified an enrichment of interleukin and interferon related functions among the network DEGs, in line with previously published findings (S1 Fig and S2 Data) [12,46,47]. However, the primary value of the Viral-Link output, in comparison to a basic evaluation of SARS-CoV-2 induced DEGs, lies within the upstream signalling proteins of the reconstructed causal network. Overrepresented functions and pathways of the upstream signalling proteins (human binding proteins, intermediary signalling proteins and TFs) included innate immunity-related functions, platelet signaling, PI3K/AKT signalling, MAPK activation, estrogen receptor-mediated signalling, senescence and a number of growth factor receptor-associated functions (such as VEGF signalling, receptor tyrosine kinases, stem cell growth factor signalling (SCF-KIT) and neurotrophin receptor signaling) (S2 Data and S3 Data). Of these functions, only innate immunity-related functions and MAPK regulation were also identified in overrepresentation analyses within the original publication of the infected NHBE cells [11]-thus highlighting the added benefit of applying ViralLink to DEG lists.
Further, network-aware KEGG pathway analysis identified a number of viral-associated pathways (such as viral carcinogenesis and Epstein-Barr virus infection), indicating a similarity

PLOS COMPUTATIONAL BIOLOGY
ViralLink: A workflow to study intracellular signalling response to viral infection between cellular responses to SARS-CoV-2 and other viruses and validating the context-specificity of the reconstructed causal network (S3 Data). Moreover, network-aware analysis identified overrepresentation of the Reactome function 'negative regulators of DDX58/IFIH1 signalling'. Importantly, DDX58/IFIH1 signalling results from cellular sensing of cytoplasmic viral nucleic acids and leads to anti-viral innate immune responses such as the production of type I interferon. Many different viruses have been shown to repress this system through varying molecular mechanisms, including hepatitis C and influenza A and B [48]. Of the 11 proteins in the causal network associated with this function, two are theoretically capable of binding to SARS-CoV-2 proteins based on interactions published by [4] (TANK Binding Kinase 1, TBK1 and NLR family member X1, NLRX1). Although further investigation is required, this finding indicates a possible mechanistic explanation for the poor induction of type I interferon response seen in SARS-CoV-2 infections [11,49,50]. Taken together, we show that ViralLink can highlight additional pathways through which SARS-CoV-2 could be affecting the lung epithelial cells, which cannot be identified by looking at the transcriptomic results in isolation.
Visualisation. Based on functional overrepresentation analysis, we created a functionspecific network by subsetting the causal network. This visualisation was used to further explore the mechanisms of how specific signalling pathways are affecting the DEGs (S2A Fig  and S4 Data). Specifically, we generated an innate-immunity associated subnetwork containing all upstream human signalling proteins associated with Reactome functions cytokine signalling in immune system, signaling by interleukins and MyD88-independent TLR4 cascade and all overrepresented functions of the DEGs (in place of the DEG nodes). These pathways contain 9/10 of the top betweenness centrality nodes (all except RHOA), evidencing the centrality and importance of the innate immune response to viral infection. Inspecting the TF layer of this immune subnetwork, we find a number of key TFs including STAT proteins (3 and 4), IRF proteins (1 and 5) and NFKB-related proteins (NFKB1, NFKBIA).
Cluster analysis. Finally, we evidenced the application of MCODE clustering analysis using the reconstructed SARS-CoV-2-infected NHBE cell causal network. We identified four clusters containing 15 or more nodes, making up 19% of the network (154/804) (S2B Fig and S2 Table and S1 Data). Assuringly, 9/10 of the top betweenness centrality nodes were included in these four clusters, further confirming the high connectivity and importance of these nodes in the causal network. Functional overrepresentation analysis of the cluster nodes highlighted a functional similarity between all four of the clusters (S2C and S2D Fig and S2 Data). Likely this is due to the high number of inter-cluster molecular interactions and because of the functional similarities between the top central nodes.
Collectively, we show that our systems biology workflow, ViralLink, reconstructs a functionally relevant intracellular signalling network affected by SARS-CoV-2 infection. Investigation of the networks through functional analysis, centrality measures and cluster analysis, combined with network visualisations, enables detailed study of the key proteins and pathways involved in signal transduction.

Availability and future directions
Infection by SARS-CoV-2 can cause a complex and systemic response by the human body. As such, a better mechanistic understanding of the effects of SARS-CoV-2 will aid identification of effective drug treatments and help to explain the differences in susceptibilities across different populations [51]. This understanding can be gained using cross-disciplinary approaches which combine 'omics data generation, computational systems biology and validatory web lab experiments [52]. Here we present a computational workflow that can be used to model the cellular response to infection by integrating knowledge of human binding proteins of viral proteins with the transcriptional response of a cell/cell type. Whilst set up primarily to run analyses based on SARS-CoV-2, ViralLink can be applied to any viral infection, provided data is available describing possible interactions between the viral proteins and human proteins.
ViralLink builds on our previously published resource MicrobioLink, which reconstructs networks representing the effect of extracellular and intracellular microbial proteins on cellular processes [21]. Differing from MicrobioLink, ViralLink inputs a predetermined list of viralhost PPIs and focuses only on pathways ending in transcriptional regulation: thereby reducing the complexity of the workflow (for accessibility and speed purposes) and increasing its predictive confidence. Furthermore, ViralLink extends the functionality of MicrobioLink with more advanced network analysis (functional enrichment, clustering and centrality measures) and visualisation options.
By exploiting previously collated and comprehensive collections of molecular interactions [24,25], ViralLink predicts how signal flows from the initial interaction with a viral protein or protein fragment to the ultimate transcriptional changes induced by the virus. Through mapping the direct intracellular effect of viral infection (using a network approach), this workflow enables further investigation into specific signalling pathways and transcription factors which play a key role in signal transduction. Signalling pathways are primarily regulated through post-translational modifications and thus cannot be directly measured using transcriptomics datasets [53]. ViralLink overcomes this problem by predicting signalling using a priori molecular interactions, a diffusion algorithm and transcriptomics data. However, this approach is limited by a lack of proteomics or phosphoproteomics data to validate predicted signalling pathways and by possible bias of the a priori molecular interactions [54]. Moreover, some of the input genes to ViralLink are often excluded from the output intracellular networks due to a lack of identified upstream pathways. On the other hand, this approach permits identification of differentially regulated genes that are likely affected as a direct result of viral recognition by protein-protein signalling pathways, rather than by secondary signals such as elevated cytokine levels. This permits a more focused analysis of possible drug targets and adds to the understanding of viral pathomechanisms. Finally, the ViralLink workflow employs functional analysis and visualisation methods to aid interpretation of the generated intracellular networks, enabling detailed investigation of key proteins and signalling pathways. Regarding functional analysis, both overrepresentation analysis and a network-aware pathway algorithm called ANUBIX are employed, considering Reactome, Gene Ontology and KEGG annotations [37,[55][56][57]. This varied approach is taken to avoid biases due to variability in results output using different functional analysis tools and functional databases [58].
Due to the modularity of the workflow, it can be easily adjusted or extended. For example, different diffusion and propagation algorithms, such as HotNet2 [59,60], could be implemented as required. The implemented diffusion tool, TieDIE, adds mechanistic value by including only logically coherent paths calculated based on the signs of interactions and activity of source and target nodes. On the other hand, due to requiring signed interactions, this feature results in a reduced possible set of input a priori interactions for TieDIE. If desired, a diffusion tool which does not need signed a priori interactions can be implemented to increase the input dataset size. Alternatively, a different method, such as an integer linear programming approach which identifies paths based on an optimisation problem (as implemented in CAR-NIVAL), could be used for network reconstruction [61]. In addition, integration of CARNI-VAL could extend the workflow to permit network reconstruction without supplying upstream perturbations (in this case the viral-host protein interactions). Whilst not currently integrated due to data availability issues, the addition of phosphoproteomics data to the pathway propagation methods could improve the prediction of active pathways [62] Alternatively, methods to predict protein activity based on transcriptional signatures, such as VIPER and PROGENy [63,64] could be added to the workflow in addition to network diffusion methods to increase the confidence of pathway predictions. Finally, extension of the network to include additional regulatory molecule types (e.g. miRNAs) or to study non-human hosts, could uncover further mechanisms by which SARS-CoV-2 can affect host cells.
Accessible through GitHub, the workflow requires only the installation of Docker (www. docker.com) enabling user to run ViralLink with only a limited programming ability. At a minimum, only two user specified input files are required: a raw counts table from a transcriptomics study (using gene symbols or UniProt protein IDs) and a two-column metadata table specifying test and control sample IDs. All other files are provided or acquired directly within the workflow-but can be changed by the user if required. However, one limitation of the current workflow is that only basic visualisation is possible programmatically, due to challenges applying one visualisation strategy to all possible output networks, especially with regard to the function-based networks.
In addition to accessibility through a default emphasis on SARS-CoV-2, a key strength of this workflow is the ability to use different input datasets: including different a priori molecular interactions, viral-human binding protein interactions and expressed/differentially expressed gene lists. This allows extensive customisation and permits rapid implementation to the most cutting-edge data soon after publication. Running the workflow across different transcriptomics datasets will allow comparison of intracellular viral responses between different cell types, different species and across different conditions (such as severe vs asymptomatic infection). For example, application of the workflow to transcriptomics data from specific immune cell-types, such as macrophages, will likely uncover different host affected signalling pathways and key TFs based on the infected cell-type. This, in turn, could increase our understanding of the role of different immune populations in fighting the infection. In addition, the workflow can be run on data from other SARS-CoV-2 strains when and if they emerge, thereby aiding comparisons of mechanisms of action between the strains.
To evidence the use of this workflow, we applied it to study the effect of SARS-CoV-2 infection in lung epithelial (NHBE) cells using transcriptomics data published by Blanco-Melo et al. [11]. In the resulting causal network, DEGs directly affected by SARS-CoV-2 initiated signalling are associated with functions that are known responses to SARS-CoV-2 and other viral infections [65][66][67][68]. Upstream of these affected genes we identified a number of potentially important signalling pathways relating to classical viral-immune responses, cell survival and cytoskeletal rearrangements and cell adhesion. On the whole, such pathways and functions were not identified when investigating the DEG lists alone, highlighting the added value of ViralLink [11]. Previous investigation of the first SARS coronavirus (SARS-CoV) identified an inhibition of cell proliferation and an increase in apoptosis regulated to PI3K/AKT signalling [69,70]. Our network of SARS-CoV-2-initiated intracellular signalling suggests that the PI3K/ AKT signalling and the AKT1 protein itself are key mediators of SARS-CoV-2 initiated signal transduction and that apoptosis and cell proliferation pathways are affected by SARS-CoV-2, thus highlighting similarities between the two viruses. However, further experimentation and/ or data curation is required to confirm the direction of change of specific pathways (up-or downregulated) based on the results of the presented workflow. Together our results indicate that SARS-CoV-2 can affect NHBE cells through a variety of signalling pathways which have been previously associated with similar viruses, including growth factor signalling, MAPK/ ERK signalling and PI3K/AKT signalling [67,[69][70][71]. Furthermore, centrality measures and cluster analysis identified proteins which likely play a key role in transduction of these signals and could be good targets for drug treatments.
Several other network reconstruction methods exist which could be and have been applied to study SARS-CoV-2 infections. For example, Messina et al. and Gysi et al. use diffusion algorithms and other similar methods to investigate proteins in close proximity to human binding proteins based on PPI interactions and gene co-expression networks [16,19,20]. Our workflow builds on these approaches by linking viral proteins to DEGs. Through this method we can observe which signalling pathways mediate the effect of the virus on cellular transcription levels, creating a systems level view of cellular changes as a result of the virus. Using the functional analysis methods and network visualisation capabilities of the workflow, it is possible to predict which viral proteins and host signalling pathways can affect specific cellular functions, enabling more focused identification of drug targets. In addition to protein mediators, this method describes TFs which are involved in the cellular response and identifies which DEGs can be affected as a direct result of viral proteins hijacking host signalling and which are affected through a different mechanism. In addition to the presented workflow, at least one other method has been used to reconstruct SARS-CoV-2-initiated intracellular signalling networks corroborating the benefits of such analysis methods [72]. Differing from the here presented approach, this work uses an extended version of the Signaling Dynamic Regulatory Events Miner method to reconstruct the networks, resulting in a more mathematically complex but computationally heavy analysis [73]. Furthermore, the workflow by Ding et al. is a less reusable and accessible workflow because it was designed for a specific analysis.
In conclusion, ViralLink is an easily accessible, reproducible and scalable systems biology workflow to reconstruct and analyse molecular interaction networks representing the effect of the viruses on intracellular signalling. We believe it is the first available integrative workflow for analysing the downstream effects of viral proteins using viral host interactions and host response data. Application of this workflow to study COVID-19 based on a wide variety of conditions and datasets will uncover mechanistic details about SARS-CoV-2 infection of different cell types, providing valuable predictions for wet-lab and clinical validation.

S2 Fig. Function-specific network of SARS-CoV-2-infected NHBE cells and cluster analysis on SARS-CoV-2-infected NHBE causal network. A)
Function-specific subnetwork containing upstream signalling proteins related to the top overrepresented (q value � 0.05) innate immunity-related Reactome functions (cytokine signalling in immune system, signaling by interleukins and MyD88-independent TLR4 cascade) and all overrepresented functions of the DEGs (in place of the DEG nodes). Layers of the network and node shapes same as in