Host Cell Factors in HIV Replication: Meta-Analysis of Genome-Wide Studies

We have analyzed host cell genes linked to HIV replication that were identified in nine genome-wide studies, including three independent siRNA screens. Overlaps among the siRNA screens were very modest (<7% for any pairwise combination), and similarly, only modest overlaps were seen in pairwise comparisons with other types of genome-wide studies. Combining all genes from the genome-wide studies together with genes reported in the literature to affect HIV yields 2,410 protein-coding genes, or fully 9.5% of all human genes (though of course some of these are false positive calls). Here we report an “encyclopedia” of all overlaps between studies (available at http://www.hostpathogen.org), which yielded a more extensively corroborated set of host factors assisting HIV replication. We used these genes to calculate refined networks that specify cellular subsystems recruited by HIV to assist in replication, and present additional analysis specifying host cell genes that are attractive as potential therapeutic targets.


Introduction
Genome-wide screening technologies offer unprecedented opportunities for discovery [1], but each method is imperfect, so that correct calls will be mixed with false positives, and authentic functions will be missed at some frequency, yielding false negatives. For example, three small interfering RNA (siRNA) screens have been reported that interrogated most of the human genes for effects on HIV infection [2][3][4][5], but though these screens identified many cellular factors previously implicated in HIV replication, the overlap between any pair of screens was ,7%. The siRNA method has many limitations [6,7]. For a gene to be detected as important during HIV infection, it must meet the following criteria: 1) It must be possible to achieve biologically meaningful reduction in mRNA levels with the siRNA(s) used. 2) The protein must be sufficiently unstable to allow functionally significant reduction over the time course tested. 3) The knockdown must not be toxic. 4) The function targeted cannot be provided by multiple redundant factors. In addition, genes may be called mistakenly due to experimental errors during high throughput analysis or off-target activities of the siRNAs used. Furthermore, siRNAs that do pass all of the above hurdles and affect viral infection may target factors that act only indirectly. Other screening technologies are also fraught with experimental limitations. However, genes identified independently in multiple studies should have a greater chance of being correctly called. Here, we report a meta-analysis of nine genome-wide screens for cellular factors associated with HIV replication.

Genome-Wide Surveys Used in the Meta-Analysis
We analyzed human gene products identified as important for HIV infection in the nine screens presented in Table 1. Three screens (lists 1-3) used transfection of siRNAs to knock down .20,000 human genes, then assessed the efficiency of HIV infection. The König et al. study [3] (list 1) used 293T cells as targets and only examined the steps of uncoating through viral gene expression. This study used a relatively large number of siRNAs per gene and included extensive mapping of the effects of knockdown to steps in the HIV replication cycle. The Brass et al. [2] (list 2) and Zhou et al. [4] (list 3) studies used HeLa cells as targets, and had the advantage of examining all the steps of HIV replication, though with less redundant siRNA coverage. List 4 contains genes near human polymorphisms identified by Fellay et al. as associated with different HIV viral loads in patients [8]. List 5 is composed of genes encoding proteins found in HIV particles that budded out of monocyte-derived macrophages [9]. Lists 6-9 contain raw screening data on binding interactions between HIV proteins and cellular proteins identified using a pull-down mass spectrometry approach (lists 6-8, targeting Nef, Tat, and Rev) or yeast two-hybrid analysis (list 9, targeting IN) [10]. List 10 summarizes interactions between HIV and cellular proteins from the published literature [11]-the depth and quality of the literature is quite variable among these proposed cellular factors, but for this analysis all calls were treated equally. Lists 11 and 12 contain siRNA screen data for two additional viruses (influenza virus studied in fly cells [12], and West Nile Virus studied in human cells [13]), allowing comparison with HIV. Report S1 presents more detailed descriptions of the 12 data sets along with extensive overlap analysis (also available at http://www. hostpathogen.org).

Overview of Genes Proposed to Be Associated with HIV Infection
A total of 1,254 genes were called as important during HIV infection in at least one genome-wide survey (lists 1-9 above), representing about 5% of all human protein-coding genes (using the RefSeq total number of 25,157). One measure of the accuracy of the genome-wide methods is assessing the overlap with genes previously identified in published peer-reviewed studies of HIV (list 10). Comparing the genes called in the HIV interaction database (list 10, 1,434 genes) to those identified in the nine genome-wide surveys (lists 1-9) yielded an overlap of only 257 genes. The union of all genes called in the genome-wide studies and the National Center for Biotechnology Information (NCBI) interaction database (lists 1-10) contains a remarkable 2,393 human protein-coding genes associated with HIV infection, or 9.5% of all human genes.
Were the genes identified in the genome-wide screens (lists 1-9) even enriched at all for previously identified HIV interacting genes (list 10)? The significance of this overlap was assessed by comparison to a random distribution. For the NCBI list of HIVinteracting factors (list 10), 1,434 randomly selected genes were drawn 1,000 times with replacement from the background of all human genes, simulating the NCBI list, and 1,254 random genes were drawn 1,000 times from all human genes as well, simulating the genome-wide list. The overlaps of the 1,000 repetitions of the random draws were quantified and plotted ( Figure 1A). The modal number of overlapping genes was 71, and no simulations showed an overlap of 257 or more genes, yielding a highly significant pvalue (,0.001). Thus the overlap, though modest, is highly significant. Genes identified in lists 1-10 were analyzed in all pairwise combinations to identify genes in common between each pair (detailed data are in Report S1, pp. 5-70) and the significance tested by simulation or calculated as in [14]. An example is presented in Figure 1B, and the numbers of overlapping genes and their significance for all pairwise combinations of gene lists is shown in Table 2.

Analysis of Overlap among the Three siRNA Screens for Genes Affecting HIV Replication
The three siRNA screens (lists 1-3) together called 842 genes as diminishing HIV replication when knocked down, or 3.3% of all human protein-coding genes (Report S1, pp. 98-120). A total of 34 genes were called in at least two siRNA screens (Table 3). Three genes were called in all three screens (MED6, MED7, and RELA). The pairwise overlaps were statistically significant (p,0.024 for all pairs of screens), but the percentages of shared genes were quite modest, ranging from 3% to 6%. The Brass et al. and Zhou et al. screens (lists 2 and 3) both surveyed the entire HIV life cycle and studied infection in HeLa cells, and these two share the greatest overlap (6%). The three siRNA screens identified the NCBI genes as 13.3%-18.3% of the total, indicating highly significant enrichment (p,0.001), as reported previously.
We then asked whether further enrichment relative to the NCBI HIV interaction database was achieved by examining human genes identified in at least two siRNA two screens. Of the 34 genes on two or more lists, 11 were previously reported in the HIV interaction database (NUP153, CCNT1, CTDP1, CHST1, CD4, CXCR4, TCEB3, JAK1, AKT1, DDX3X, and RELA), or 30% of the total, substantially higher than the 13%-18% identified in each single list alone. From this we infer that the newly identified genes called in two or more siRNA screens (Table 3) are more likely to be authentic new cellular cofactors for HIV infection. Twenty-nine out of the 34 genes were found to be expressed in cells or tissues expressing CD4 and coreceptor by transcriptional profiling analysis, and so competent for HIV entry. Of the remaining five, CCNT1 (cyclinT1) is known to be expressed in T  The p-value calculated using the hypergeometric distribution was slightly lower (p = 0.014). (C) Simulation of expected overlap between screens given the measured error between replicates. The standard deviation of infectivity measurements were calculated from the Kö nig et al. siRNA screens, and then simulated datasets were generated containing the measured error. For simulations, either two replicates (pink) or ten replicates (yellow) were generated and the overlap quantified. The y-axis: number of top-scoring genes considered in overlap analysis; x-axis: actual number of overlapping genes seen comparing simulated data sets. (D) Choices for toxicity threshold strongly influence the recovery of genes affecting HIV infection. The genes tested in the Kö nig et al. siRNA screen were ranked according to toxicity of knockdown, then sets containing 100% of genes, the least toxic 50%, or the least toxic 20% were extracted (top). From each of these, the 300 genes that when knocked down showed the strongest reduction in HIV infection were then selected, and the overlap between gene sets calculated (bottom). doi:10.1371/journal.ppat.1000437.g001 cells and represents a false negative call in the expression data used. A comprehensive table of all genes identified in pairwise combinations of lists 1-12 is provided at the end of Report S1 together with the expression analysis (pp. 72-98). Why did the three different siRNA screens yield such different gene lists? One possible explanation could be that the expression of host cell factors differed between the HeLa and 293T cells studied. However, analysis of transcriptional profiling data showed that .93% of the genes called as important for HIV infection in any one of the three studies were expressed in both cell types.
However, variation due to 1) experimental noise, 2) timing of sampling, and 3) different filtering criteria likely do explain some of the differences. Two replicates were available for analysis from the König et al. screen, allowing estimation of the variance. From this, the expected overlap for of the top 300 genes in replicate screens could be simulated. A test of two replicates or ten replicates per screen ( Figure 1C) yielded 150 or 240 overlapping genes, illustrating how the high variance reduced the overlap, but replication improves it.
A second source of variation was differences between time points analyzed, which varied among the published siRNA screens. Although data were not available for multiple time points for the HIV screens, data were available for a screen of influenza virus infection at three time points (S. Chanda, unpublished data). Analysis demonstrated that variation between time points was of the same magnitude as variation within time points and partially independent.
A third source of variation is likely to be differences in the filtering thresholds used. We investigated the effects of different choices for the toxicity filter by reanalyzing the data of the König et al. screen using three different toxicity thresholds. In the first, no filter was applied (100% of genes were accepted for further analysis), in the second, only genes in the 50% least toxic group were considered, in the third, only the 20% least toxic genes were considered. For each set, the 300 genes with the strongest reduction in HIV infection after knockdown were extracted and the overlap among sets compared ( Figure 1D). Fewer than 150 genes out of 300 overlapped between the 100% and 20% sets, and the maximum between any pair was 222 genes, indicating that the final gene set called is very sensitive to the toxicity threshold chosen.
Thus variations between replicates, between time points, and in filtering thresholds all likely contributed to the differences between siRNA screens. Further differences also likely arose from use of different siRNA libraries, cell types, and viral strains [5].

Recovery of Well-Documented Host Genes Affecting HIV in Genome-Wide Studies
A variety of well-documented cellular cofactors for infection were identified in two out of three screens, including i) the binding and entry factors CD4 and CXCR4; ii) the NFkappaB subunit RELA; iii) the activating kinases AKT1 and JAK1; iv) the Vpr and Vif cofactor TCEB3/elonginB; and v) the Tat cofactor CCNT1/cyclinT1 [16] (which was also in the mass spectrometry study of Tat-associated proteins). The Rev cofactor DDX3X [17], an RNA helicase, was also identified in two out of three screens and in addition was found by mass spectrometry to bind to both Tat and Rev. The DNA repair factor MRE11 was also identified, which was previously implicated in HIV DNA circularization [18], though effects on HIV infection efficiency have not been reported previously.
A variety of further well-established factors were identified in one siRNA screen only. The well-studied viral budding factor TSG101 [19] was called in the Zhou et al. siRNA screen (list 3), and also identified as associated with HIV particles after release. The Rev cofactor XPO1/CRM1 [20] was called in the Zhou et al. siRNA screen but not the others.
Also instructive is analyzing the known HIV cofactors that were not identified. HLA-B57 and HLA-C have well-documented effects on viral set point and HIV disease progression [8,21], but these were not detected in the siRNA screens, probably because the HLA proteins affect the immune response to HIV and not replication at the cellular level. The integration cofactor PSIP1/ LEDGF/p75 was not identified, probably because only very complete knockdowns diminish HIV replication [22][23][24][25][26][27][28][29][30]. Several genes known to encode products important for HIV replication were identified in the initial screen of König et al., which yielded 4,019 candidates, but were not further validated in the filtered data set of 293 proteins. These included Sp1, a transcription factor known to bind the HIV LTR; the HIV Gag binding protein cyclophilin A (PPIA) [31]; and several integrin proteins, believed to assist in virus binding to cells (ITGB1, ITGB2, ITGB3). The ESCRT proteins are known to be important in HIV budding [32], but only VPS24 was identified. Another member of this complex, VPS53, was called in the Brass et al. study, and the initial unfiltered König et al. screen, but not in other studies. The RNA lariat debranching enzyme DBR1 was used as a positive control in the König et al. study, and is well known to affect reverse transcription [33], but DBR1 was not identified in any of the other studies. Thus, the recovery of already implicated host factors was generally good in the overlap analysis, providing confidence about the authenticity of the newly called genes. However, some welldocumented factors were missed, indicating that other factors important for HIV replication were probably missed by the analysis.

Network Analysis of the HIV-Host Interactome
To identify the cellular subsystems recruited by HIV in more detail, we assembled a host-pathogen protein interaction network based on the gene products in lists 1-10 ( Table 1). The network interaction map took advantage of binary protein binding relationships cataloged in curated literature-based protein-protein interaction databases (i.e., Bind, HPRD, MINT, Reactome, etc.). HIV-host interactions were predicted based on evidence compiled in the NCBI HIV interaction database. The resulting HIV-host network was comprised of 1,657 cellular proteins that formed interactions with other host cell factors or HIV-encoded proteins (p,10 25 ). Two hundred and ninety of these host proteins (''nodes'') were supported by experimental evidence from two or more datasets, reflecting a 35% enrichment of proteins that are called by multiple datasets in this analysis. We performed a further analysis to identify unusually dense network neighborhoods within this interactome map using a graph theoretic clustering algorithm (MCODE) [34]. This revealed 11 putative molecular clusters, ten of which could be associated with distinct biochemical or cellular functions.

Proteasome
A densely connected network of proteasome subunits was identified by the MCODE analysis ( Figure 3A). The proteasome was prominent in the published siRNA screening data, and implicated in probable early steps of viral infection. In previous literature, the proteasome was shown to act negatively on HIV infection by destroying replication intermediates [35,36]. The siRNA data indicate that the proteasome may also facilitate HIV infection. The mechanism is unclear and could be indirect-for The color code indicates the number of genes in each functional group from each screen derived using DAVID Functional Annotation Clustering. Annotations for each function group were based on the assessment of GO categories that comprised each group, which can be found in Table S1. doi:10.1371/journal.ppat.1000437.g002 example, reducing proteasome activity may alter cellular ubiquitin levels, and so affect HIV replication by altering the free ubiquitin pool.

Transcription/RNA Polymerase
Genes for subunits of RNA polymerase II and associated factors were identified in several different screens, yielding a densely connected network ( Figure 3B). In some of the siRNA screens, the knockdown of Pol II subunits was mapped to the step of Tat transactivation. Because so many subunits were identified, the simplest interpretation is that reduced dosage of the full complex is responsible for the deficit in HIV replication.

Mediator Complex
Multiple subunits of the mediator complex were identified in two or more siRNA screens (Table 3 and Figure 3C). The mediator complex links transcriptional activator proteins to the RNA polymerase II basal transcription apparatus, thereby allowing transcriptional activation [37,38]. The observation that so many subunits were identified suggests that activity of the complex as a whole is the target of siRNA modulation. Viral replication cycle mapping by Zhou et al. indicated that some of the mediator proteins were needed to support Tat-activated transcription, though studies in König et al. suggest a possible further role in reverse transcription. The data can be accommodated in a model where changes in dosage in the mediator complex are not toxic to cells, but where Tat-activated transcription is extremely sensitive to mediator dosage. Previously, mediator was shown to be important for Sp1-driven transcription, and Sp1 is required for transcription from the HIV LTR, suggesting possible involvement of Sp1 as well.

Tat Activation/Transcriptional Elongation
A dense network was formed containing the Tat cofactor cyclin T1 (CCNT1) [16], which was identified in two out of three siRNA screens and by mass spectrometry ( Figure 3D). Together with its binding partner CDK9, which was identified as a Tat binding protein, cyclinT1 forms positive transcription elongation factor b (P-TEFb). The MCODE analysis links the P-TEFb complex and the elongin complex involved in transcriptional elongation. Another factor, the RNA Pol II carboxyl-terminal domain (CTD) phosphatase CTDP1, was also identified and was also previously associated with Tat activation. In addition, two STAT proteins, also involved in transcription and NFkappaB signaling and implicated in lentiviral infection [39], were identified ( Figure 3E).

RNA Binding/Splicing
A large cluster of RNA binding and splicing proteins was identified in the MCODE analysis. Eleven of the cellular genes encode protein components of hnRNP complexes (HNR factors) that form on pre-mRNA and direct splicing and other activities. HNRNPU contains both an RNA binding domain and a DNA binding domain that mediates attachment to the nuclear scaffold, potentially linking sites of mRNA synthesis to specific sub-nuclear locations. Six further genes (SF3 factors) encode components of the splicing factor 3 a/b complex, which is involved in activating the U2 snRNP and promoting splicing. Three SNR proteins and two SF proteins were also identified and are implicated in RNA splicing and RNP formation. Several of these proteins were implicated in the literature to modulate Tat or Rev function (e.g., [40][41][42]), and seven direct binding interactions to these viral proteins were identified in the mass spectrometry data reported here.
Several observations also suggest possible connections of RNAP/splicing factors to the viral DNA integration step. Two components of the splicing factor SF3 bound integrase in the yeast two-hybrid data (list 9) [10]. The splicing protein SNW1/SKIIP1 was found by König et al. be selectively important at the integration step [3]. The integrase-interacting protein PSIP1/ LEDGF/p75 appears to tether integrase to active transcription units [23,29,43,44], and an alternatively spliced variant of this protein (p52) is involved in RNA metabolism [45]. Though indirect, these observations suggest a model in which splicing factors may help recruit integrase to active transcription units, which are favored for integration [46][47][48].
Another possible role of splicing factors is in maintaining the proper balance between spliced and incompletely spliced HIV RNAs. HIV replication requires multiply spliced messages (encoding Tat, Rev, and Nef), singly spliced messages (encoding Vif, Vpr, Env/Vpu, and a second form of Tat), and unspliced messages (encoding Gag and Gag-Pol). The unspliced RNA also serves as the genomic RNA. Alterations in dosage of splicing factors by siRNA knockdown may well diminish HIV replication by altering the ratios of the different HIV mRNA forms.

The BiP/GRP78/HSPA5 Chaperone
The BiP/GRP78/HSPA5 protein chaperone was identified in the network analysis ( Figure 3G). BiP/GRP78/HSPA5 is a member of the HSP70 family that is involved in the folding and assembly of proteins in the endoplasmic reticulum. BiP has been implicated in interacting with newly synthesized HIV gp160 SU/ TM precursor [49], and HSP70 family members have been proposed to interact with Gag, Tat, Vpr, and MA. The MCODE analysis connected BiP/GRP78/HSPA5 to a collection of nuclear proteins involved in splicing (PRPF8, SFPQ, and SNW1), nuclear matrix architecture (MATR3), and ubiquitylation (UBR5). Determining how these cellular proteins modulate the interactions of BiP/GRP78/HSPA5 with HIV proteins offers a potential route to better understanding protein folding and sorting during HIV replication.

The CCT Chaperone
The MCODE analysis identified subunits of the chaperone containing TCP1 (CCT) complex ( Figure 3H). Subunits were identified in siRNA screens, in HIV particles after budding, and also as Tat binding proteins. This complex consists of two identical stacked rings of eight subunits. Unfolded proteins are thought to pass through the central cavity, and become folded in an ATPdependent manner. The CCT chaperone has not previously been associated with HIV replication, and represents a new candidate for involvement in Tat activation and HIV budding.

Additional Densely Connected Clusters
Several further functions were identified, including proteins involved in t-RNA synthase function, transport, and one of unknown function (Figure 3 I-3K). The t-RNA synthase and transport complexes contained members that associated with Tat according to the mass spectrometry study, and the unknown complex contained a member binding to Rev, suggesting specific links to HIV replication.

Other Newly Identified Functions
Several sets of proteins were identified that were not called as densely connected networks but appear to be functionally related. The nuclear pore and associated factors were clustered in the initial MCODE network but were not sufficiently densely to emerge as a densely connected network. Proteins involved in nuclear import identified in two out of three siRNA screens included products of NUP153, RANBP2, TNPO3, and RGPD8. NUP153 and TNPO3 have been associated with the trafficking of HIV proteins previously [2,3,[50][51][52]. RANBP2 is a giant gene encoding a product that accumulates at nuclear pores and binds to RAN, which is a small GTP-binding protein of the RAS superfamily. RANBP2 also contains FG repeats, a cyclophilinrelated nucleoporin, and a domain that binds UBC9, the E2 for SUMO1 transfer. RGPD8 is named for ''RANBP2-like and GRIP domain containing 8''. It too accumulates at the nuclear pore and is believed to assist in RNA and protein transport. The actions of NUP153 and TNPO3 have been mapped to nuclear import of the HIV preintegration complex in [2,3,51], and NUP153 has also been proposed to be involved in export of HIV Rev [53].
Three genes were identified that affect the microtubule system. MAP4 is a microtubule-associated protein that has not previously been studied in detail. MID1IP1 is a regulator of microtubule polymerization. CAV2 (caveolin 2) is involved in the formation of plasma membrane invaginations involved in a variety of cellular functions including signal transduction, cell growth, and apoptosis. Caveoli have also been implicated as interacting with the microtubule network [54]. Previous studies have suggested that HIV particles may traffic along microtubules to reach the nucleus [55]. Thus MAP4, MID1IP1, and possibly CAV2 are candidates for cofactors in this process. Other proteins were also called in two siRNA screens (ANAPC2, DMXL1, HMCN2, and IDH1) but are of unknown function (Table 3).

Identifying New Drug Targets
One of the main reasons for carrying out the screens for host factors is the hope of identifying new targets for HIV therapeutics. Several studies have indentified potentially ''druggable'' human proteins by cataloging families of InterPro domains where one member is the target of one or more small molecule inhibitors with drug-like properties. All members of the family are then proposed as potential drug targets [56,57] (John Hogenesch, data available at http://www1.qiagen.com/Products/GeneSilencing/LibrarySiRna/ SiRnaSets/HumanDruggableGenomesiRNASetV30.aspx?ShowInfo =1). In Report S1 (pp. 72-120), we annotate our overlap study for ''druggable'' targets by these criteria. Focusing on an updated version of the list from Hopkins and Groom [56], we found that eight of the 34 genes common to the two siRNA screens were called as potential drug targets (Table 3, column labeled ''druggable''). Two of these eight are in fact known to be the targets of small molecules with activity against HIV. CXCR4 is the target of AMD3100 and related molecules [58,59], and AKT1 is the target of miltefosine [60]. The fact that two out of eight genes called as druggable are known antiviral drug targets (at least in tissue culture) suggests that this analysis is yielding viable new targets. Of the additional genes encoding candidate drug targets, inhibitors have been reported for the kinases ADRBK1/GRK2 and JAK1. These can be tested for activity against HIV in cell culture. Annotation of the larger collection of genes that were found on two or more lists (lists 1-9) yielded a further 56 genes encoding potentially druggable cellular factors.

Summary
Analysis of genes called as important for HIV replication in multiple genome-wide screens yielded a list rich in well-known factors and also intriguing new candidates. Many important factors were surely missed by this approach, but at least some of the most promising new genes can be distilled from among the 9.5% of all human protein-coding genes now proposed to affect HIV infection. Many of the new genes can be linked into clusters, specifying cellular subsystems associated with HIV replication. Promising drug targets could be discerned among the bestdocumented new factors.

Methods
Overlap analysis and comparisons to random distributions were carried out using R [61]. The p-values for overlaps between lists were generated by comparison to results of random simulation and by calculation based on the hypergeometric distribution as in [14]. No correction was applied for multiple comparisons. Networks were generated using MCODE analysis on the binary interaction file and plotted in cytoscape. The global interaction network was judged to be statistically significant by comparison to random simulation (p,10 25 ). Subnetworks were selected that 1) contained at least two proteins from different studies and 2) showed high connectivity. The protein-protein interaction data for Nef, Tat, and Rev from the HARC Center were derived using previously described methods (LC MS-MS followed by database matching) [62]. A more thorough analysis of these (early stage) data will be reported elsewhere.
Updated versions of Report S1 documenting the overlap among genome-wide screens can be found at http://www.hostpathogen. org.