Multitask Learning of Signaling and Regulatory Networks with Application to Studying Human Response to Flu

Reconstructing regulatory and signaling response networks is one of the major goals of systems biology. While several successful methods have been suggested for this task, some integrating large and diverse datasets, these methods have so far been applied to reconstruct a single response network at a time, even when studying and modeling related conditions. To improve network reconstruction we developed MT-SDREM, a multi-task learning method which jointly models networks for several related conditions. In MT-SDREM, parameters are jointly constrained across the networks while still allowing for condition-specific pathways and regulation. We formulate the multi-task learning problem and discuss methods for optimizing the joint target function. We applied MT-SDREM to reconstruct dynamic human response networks for three flu strains: H1N1, H5N1 and H3N2. Our multi-task learning method was able to identify known and novel factors and genes, improving upon prior methods that model each condition independently. The MT-SDREM networks were also better at identifying proteins whose removal affects viral load indicating that joint learning can still lead to accurate, condition-specific, networks. Supporting website with MT-SDREM implementation: http://sb.cs.cmu.edu/mtsdrem


Supplementary Methods
Optimizing the MT-SDREM objective Using α we maximize the following objective: Here n t is the number of conditions the TF t is predicted to regulate, where T c is the list of TFs predicted to regulate the time series for condition c, P c t is the set of paths that start from a source of condition c and end in TF t, h p is the weight of the path which is defined as the multiplication of the probabilities of the edges in the path, s tc is the score of the TF t for condition c, and I(p) is an indicator function indicating whether path p is satisfied or not (a path is satisfied if all the edges in the path are oriented in a direction that links the source to the target).
For α ≥ 1 the objective above would prefer selecting joint TFs to equally explanatory TFs that are not shared. Thus α represents a trade-off between fitting individual networks (specifically, α = 1 means that we are back to our condition independent network learning) and learning a single joint network (very high values of α will lead to the selection of the same TFs for all networks). Note that the n α t factor implements the ρ function (for regularizing between tasks). The procedure to select an appropriate value for α is described later.
We use a greedy algorithm to optimize the objective. We randomly select a direction for every edge that has conflicting direction, i.e. it is present in opposite directions in two different pathways. We then do a local search to arrive at a local minimum. We flip the directions of the conflicting edges, always choosing the flip that increases our objective by the highest amount until we can't find any flip that would still improve the objective. This approach is similar to SDREM's approach which has been shown to work well, both on real and on simulated data [1].
After we optimize the above objective, we obtain a single oriented network for each condition. We then use that network to obtain new priors for TFs for DREM. First we compute the weights for the TF for each condition using the equation where t is the TF and P c t is the set of selected paths for condition c that end in TF t. To normalize these scores, we further run the above orientation procedure L number of times, each time with an additional set of randomly selected TFs which are not predicted to regulate any of the conditions. We use the random score to adjust the score for the predicted TF [1].

Detailed description of the algorithm
We constructed a probability for each protein-protein interaction (ppi) using the formula 1 − Π n i=1 (1 − p i ) where p i is our confidence that the ith experimental evidence for the ppi is a true positive. The network so constructed has in the HGNC symbol naming scheme, 228,159 edges and 16,671 proteins with maximum degree 9,368 and average degree 27.372.
1. First, we pre-process the data. Let S = ∪ c∈C S c where S c is the set of sources for condition c. We exhaustively search for all simple (non-cylic) paths from all sources in S to all TFs in our protein interaction network. The weight of a path is defined as the multiplication of the probabilities of the edges in the path (actual calculation done by summation in log space). We select and keep the top k paths by weight. Denote this set of paths P and let the weight of path p be h p .
2. We run DREM (Dynamic Regulatory Events Miner) on the time series data for every condition individually with default prior π tg = 0.5 for all TF-gene interactions. DREM is based on an input output Hidden Markov model and annotates the various time points of the time series with TFs that are supposed to be regulating the genes at those time points.
3. We extract a list of TFs from the results output by DREM and assign them a score in the same manner as done in SDREM. Let T c be the set of TFs for condition c and let s tc be the score of TF t in condition c. Let n t be the number of conditions TF t is present in. Let T = ∪ c T c and T all be the set of all possible TFs. In addition, the score for a TF is increased by multiplication with the n α t factor discussed in the previous section. As mentioned before, this factor is how we implement the ρ function of the objective. 4. We create the TF sets T i = T ∪ T r where T i r is a randomly selected set of TFs from T all \T such that |T r | = |T |. We create L such sets. In addition we also create the set T L+1 r = ∅ and thus T L+1 = T . 5. For every TF t ∈ T i , 1 ≤ i ≤ L + 1, we then compute f ti = c p h p · s tc where the path p ends in TF t and p has edges that are of the same orientation as those reached in optimization problem i.
6. We then create a list M consisting of all the f ti so computed and sort that list according to the f ti .
7. Then for a TF t ∈ T , if f t,L+1 is in the 80th percentile of list M , we increase it's prior via the formula π new tg = (π tg + 1)/2 for all genes g. In addition, if f t,L+1 is greater than the node threshold parameter, the prior is also increased similarly. If neither of the two conditions hold, we decrease the prior via the formula π new tg = max(0.01, π tg /2) 8. We run steps 2-7 for 10 iterations which is the default number of iterations of SDREM.
Learning parameters for the multi-task objective For handling parameters for the SDREM component, we refer the reader to [1]. We use the default provided parameters for SDREM inside of MT-SDREM. MT-SDREM adds the α parameter to the set of parameters. α encodes our prior on how much the given conditions are related. One way to choose α is to perform cross validation on say the number of RNAi hits one obtains in the top k ranked genes. Another, which is what we use here, is to look for an α which is in a stable region -i.e. perturbing it would not change the results in terms of the TFs that we extract. We found that α = 6 achieves such stability and we thus use that for all our experiments. Results for other α close to 6 are provided in the Supplementary Results.

Constructing the joint signaling network
To construct the joint signaling network in Figure 1 of the main text, we took the top 200 source, intermediate, and target (TFs predicted to regulated the condition's gene expression) proteins from each of the 3 conditions. We then looked at the set of paths from a source protein to an intermediate protein to a target protein and computing the condition-specific path flow for each protein (see Materials and Methods for details on how to compute it given a set of paths). We also computed the path flow for each edge between the selected proteins. Edge path flow was computed in a similar manner to node path flow by summing path scores for all paths containing that edge. We then only selected nodes which had a flow of at least 1000 and edges which had a flow of at least 200.

Kemeny-Young Method
The Kemeny-Young method is used to merge several different ranked lists r 1 , . . . , r n into a single ranked list R. R is constructed so as to minimize the total number of conflicts with the lists r 1 , . . . , r n . For example, if r i ranks gene A above gene B, but R ranks gene B above gene A, that counts as 1 conflict. The sum of such conflicts between R and r i , 1 ≤ i ≤ n is what is minimized by the Kemeny-Young method.

Regulatory networks for H3N2 and H5N1
The regulatory networks for H3N2 and H5N1 are presented in this section. The discussion of the figures is in the Results section of the main text. D E F G H Figure S1. The H3N2 regulatory network is presented. The paths represent the different gene sets which are coexpressed. The TF lists are the TFs predicted to regulate the path that they are connected to. Figure S2. The H5N1 regulatory network is presented. The paths represent the different gene sets which are coexpressed. The TF lists are the TFs predicted to regulate the path that they are connected to. Table S1. Strain-specific protein list

RNAi screen hits
We compiled primary hits from three genome-wide [2][3][4] and two targeted [5,6] RNAi screens. The studies varied in the type of human cells, viral strain, functional readout, and other experimental conditions as reviewed in [7]. Brass et al. [2] identified host genes required for viral replication by quantifying the surface expression of the viral hemagglutinin protein. Karlas et al. [3] measured both viral nucleoprotein levels and released viral particles to probe different stages of the viral replication cycle. König et al. [4] detected host factors that inhibit replication efficiency using a modified H1N1 virus that contained a luciferase reporter in place of the hemagglutinin gene. Shapira et al. [5] first identified human proteins that directly interact with viral proteins and their neighbors in the human PPI network as well as human genes that are differentially expressed following viral infection or related treatments. They expanded this set of candidate genes to include related pathway members, and performed three assays to quantify the impact on viral production and interferon beta levels in response to viral RNA transfection or infection by a mutant H1N1 strain. As in [7], we considered only the primary hits that inhibit viral replication upon knockdown. Bortz et al. [6] focused specifically on human genes that modulate viral polymerase activity, screening only a targeted list of human proteins known to interact with H1N1 viral polymerase, nucleoprotein, or ribonucleoprotein that was further refined based on interaction network properties and molecular function. The host genes' impact on viral polymerase activity was separately assessed in two screens: one that transfected human cells with viral polymerase alone and another that infected cells with the complete influenza virus. We included genes that significantly affected polymerase activity in either screen as RNAi hits and collected strain-specific hits for the independent H1N1 and H5N1 screens. In Figure S3, we present RNAi screen hits comparison between GeneMania, MT-SDREM, and, I-SDREM. GeneMania outputs a list of genes predicted to be associated with a set of seed genes. We again gave the list of source proteins as the seed genes. Since GeneMania doesn't include the seed genes in its resultant ranking, we compared the ranking output by MT-SDREM and I-SDREM with the source proteins removed, against the results output by GeneMania. The results are in Figure S3. GeneMania gets only 7 RNAi screen hits in its top 100 genes. I-SDREM and MT-SDREM do considerably better getting 21 and 23 genes respectively.
In Table S3, we give the the screen hits overlap for the top 100 genes as ranked by Endeavour for the various configurations we ran Endeavour in.

Further comparison to oPossum
we have also tested oPossum with the three different DE gene sets (one for each condition). Below we present the condition-specific comparison to oPossum. As highlighted, MT-SDREM finds far more immune response and flu response related TFs compared to oPossum. Table S13. oPossum vs MT-SDREM for H1N1      Table S19. GO comparison between the DE gene list and the MT-SDREM for H3N2 for the top 500 genes. The enrichment was performed using the FuncAssociate tool [9]. Only categories with DE genes adjusted p-value of ≤ 0.001 and MT-SDREM genes p-value of ≥ 0.01 are presented. If a p-value for MT-SDREM is NA, that means that that category was not enriched for in the MT-SDREM list. All immune response related categories are presented.         No GO enrichment at all was found for the top 1000 DE genes for H5N1.  Table S28. The number of strain-specific proteins and their overlap with RNAi screen hits is given.

GO comparison with I-SDREM
Below we present the condition-specific GO comparison with I-SDREM for the top 500 genes. The FuncAssociate tool was used for the GO comparison. While the I-SDREM list gives slightly better GO categories compared to the MT-SDREM list, we'd like to note that the opposite can be true when the DAVID tool [10,11] is used for comparison. There, the MT-SDREM list gives slightly better categories. Thus we conclude that the GO differences between the two tools are not significant.

H1N1 comparison
MT-SDREM finds 163 immune related categories while I-SDREM finds 182 such categories with Fun-cAssociate. Some of those categories are presented below -the full list is on the Supporting website. (By contrast, when using DAVID, MT-SDREM finds 296 immune related categories while I-SDREM finds only 282).