The HIVToolbox 2 Web System Integrates Sequence, Structure, Function and Mutation Analysis

There is enormous interest in studying HIV pathogenesis for improving the treatment of patients with HIV infection. HIV infection has become one of the best-studied systems for understanding how a virus can hijack a cell. To help facilitate discovery, we previously built HIVToolbox, a web system for visual data mining. The original HIVToolbox integrated information for HIV protein sequence, structure, functional sites, and sequence conservation. This web system has been used for almost 40,000 searches. We report improvements to HIVToolbox including new functions and workflows, data updates, and updates for ease of use. HIVToolbox2, is an improvement over HIVToolbox with new functions. HIVToolbox2 has new functionalities focused on HIV pathogenesis including drug-binding sites, drug-resistance mutations, and immune epitopes. The integrated, interactive view enables visual mining to generate hypotheses that are not readily revealed by other approaches. Most HIV proteins form multimers, and there are posttranslational modification and protein-protein interaction sites at many of these multimerization interfaces. Analysis of protease drug binding sites reveals an anatomy of drug resistance with different types of drug-resistance mutations regionally localized on the surface of protease. Some of these drug-resistance mutations have a high prevalence in specific HIV-1 M subtypes. Finally, consolidation of Tat functional sites reveals a hotspot region where there appear to be 30 interactions or posttranslational modifications. A cursory analysis with HIVToolbox2 has helped to identify several global patterns for HIV proteins. An initial analysis with this tool identifies homomultimerization of almost all HIV proteins, functional sites that overlap with multimerization sites, a global drug resistance anatomy for HIV protease, and specific distributions of some DRMs in specific HIV M subtypes. HIVToolbox2 is an open-access web application available at [http://hivtoolbox2.bio-toolkit.com].


Introduction
There is enormous interest in studying HIV pathogenesis for improving treatment of HIV patients. Currently, most drug therapies specifically target HIV proteins. In fact, HIV infection and replication involves ,24 processed HIV proteins and thousands of host proteins [1][2][3][4][5][6][7][8][9]. As the study of HIV enters its fourth decade, HIV infection has become one of the best-studied systems for understanding how a virus can hijack a cell.
There is now abundant information about HIV protein sequence, structure, function, and evolution. Several databases have emerged that focus on select specific domains of HIV knowledge. From the sequence perspective, the use of sequencing and genotyping as a clinical diagnostic has driven the sequencing of tens of thousands of HIV variants, many of which are collected into databases including the Los Alamos HIV Sequence Database [10,11]. The Protein Data Bank contains more than 1,300 HIV protein structures. And the National Institute of Standards and Technology (NIST) HIV structural database provides several tools for searching HIV drugs and their interactions with proteins [12,13]. These tools allow investigation of drug binding sites. Since HIV has a high mutation rate, many known mutations result in drug-resistant HIV strains. These mutations have been collected into several databases updated in annual reports by the International AIDS Society [14][15][16][17][18].
Several data sources focus on a functional perspective. The HIV Human Protein Interaction Database lists many protein-protein interactions with, and posttranslational modifications of, HIV proteins. More interactions have been identified in affinity capture mass spectrometry experiments [19][20][21]. Multiple high-through-put RNAi screens have identified more than 2,400 host dependency factors (HDFs) involved in HIV replication [2][3][4][5][6][7][8][9]. And BioAfrica and the Los Alamos HIV Sequence Database have several additional tools for assessing different aspects of HIV function [1,10].
Although scientists have accumulated a large amount of data regarding HIV proteins, the use of this data by researchers is limited by graphical user interfaces generally geared toward a focused facet of HIV virology. To address this issue, our laboratory recently released HIVToolbox, a database featuring integrated information about HIV proteins and a web system that presents a unified view of this information to facilitate the study of HIV sequence, structure and function [22]. In several example analyses of HIV-1 Integrase, we demonstrated that broad scale integration of sequence, structure, and functional information into a graphical mining tool can be used to identify new HIV biology [22]. Since publication of HIVToolbox, .37,000 searches have been performed. Resistance precursor A mutation that has no effect on resistance, but must occur prior to another primary or primary set of mutations Here, we report a number of significant updates to HIVToolbox that provide new functionality, with a general focus on antiretroviral (ARV) drugs and immune tolerance. These functions enable many new types of comparisons, which may lead to some novel global perspectives about HIV pathogenesis. Our observations include an anatomy of drug resistance in HIV protease where specific types of drug resistance mutations are localized to specific regions, and many posttranslational modification and protein-protein interactions sites overlapping with multimerization interfaces in HIV proteins. Because Tat has so many overlapping functional sites, HIVToolbox2 can assist with experimental design and interpretation of experiments related to this protein.

Classification of HIV drug resistance
We added a number of new functions in HIVToolbox2. Several are based upon HIV drug-resistance mutations. In order to compare functional data for HIV proteins to HIV drugs, we first needed a source of drug-resistance mutations. We obtained 1,571 known HIV-1 DRMs (872 for FDA-approved drugs) from the Los Alamos HIV sequence and Stanford HIV databases, the World Health Organization website, and primary literature [10,23].Drug-resistance mutations were then consolidated into a SQL database. The literature for each mutation was re-evaluated to classify each mutation into one of seven categories (The names and summary descriptions of the seven categories are shown in Table 1.) We implemented this new scheme because, as we annotated DRMs from the literature and other databases, we observed DRMs that did not fit the standard categories of major and minor [24] (Definitions for the new scheme can be found in Table 1.) Briefly, DRM types designated beneficial or beneficial set (for decreasing drug resistance) are colored different shades of green. Those that cause resistance, primary and primary set, are colored red and pink, respectively. Those that amplify resistance are called secondary set and are colored purple. The few mutations that do not affect resistance directly, but which are precursors to other DRMs, are called precursors and are colored light blue. There is a checkbox option to view ambiguous mutations, which are colored white. Ambiguous mutations are those DRMs identified from another database for which a published peer-reviewed source could not be identified.
The combined information from the Stanford Drug Resistance database and the 2011 update from the International AIDs Society contains 188 DRMs that were classified as major or minor and had an identifiable published reference in a peer-reviewed paper ( Table 1) [15,16]. Review of the drug resistance literature identified a number of mutations in these databases that did not have an identifiable peer-reviewed paper; these were classified as ambiguous and not used. We also identified mutations that were published and not present in these databases. Our refactored database contained 671 unique DRMs in the seven categories discussed above ( Table 1). Our new classification scheme is used in several new features added in the HIVToolbox2 application, and has helped to identify an anatomy of drug resistance patterns for protease and reverse transcriptase addressed later herein.   Fig. 2 with an additional color for binding site residues that do not have a known DRM (orange). B. Information for each Drug Binding Site Residue is shown in a table that is color-coded using the same coloring scheme. A distance threshold between atoms of the drug and atoms of the protein (2.5-4.0 Å ) can be set using a pulldown menu; 4.0 Å was set in this figure. This table provides the chain:position of the amino acid, distance, whether it is a DRM, and the type of DRM. The first column of this sortable table is interactive, where a mouse click identifies the amino acid in the structure of the Drug Binding Site window (A). doi:10.1371/journal.pone.0098810.g003  Enhancements to the HIVToolbox2 program HIVToolbox2 boasts many improvements over the original HIVToolbox [22]. The introduction page contains new HIV protein and drug-selection menus. The Drug menu enables direct loading of structures of HIV protein:ARV drug complexes. The HIVToolbox2 interface can also be accessed from hyperlinks from structures of HIV proteins in the Protein Data Bank website [12].
Once a protein or drug is selected, this directs the user to an interactive results page containing a set of windows. HIVToolbox2 has Sequence and Log windows that are similar to the original HIVToolbox with minor modifications to improve usage (Fig. 1). The Sequence window has been widened to show rows of 100 residues (Fig. 1A). The lines above the protein sequence are used to identify (hover mouse over the line) and load different structures into the structure windows. This is necessary, since many different structures and chains are available for certain HIV proteins. Two options for viewing chains are now available. The default view is visible when the ''Display individual chains'' checkbox is checked. This view shows all chains available for a particular structure for the selected HIV protein. Deselect this checkbox and only the structures of HIV:ARV complexes are shown, with the longest version of the chain for each structure and no chain redundancy (The lines are thicker to distinguish between the two displays). Other interactive functions of the Sequence window have not changed.
When selections are made in the Sequence window, relevant information is output to a modified Log window with two tabs. The Color Key and Motif Key log windows from the original HIVToolbox have been combined into separate tabs of a consolidated Log window ( Fig. 1B and 1C). All minimotifs functional sites, and protein-protein interactions in the Log window are hyperlinked to PubMed abstracts for the reference sources.
A signature feature of the original HIVToolbox was three synchronized interactive protein structures displays, each showing different information about protein multimerization, domains, minimotifs, protein-protein interaction sites, functional sites, and protein sequence conservation. These windows still have the same function with some minor modifications. Protein chains are now selected from a pulldown menu in the Structure Windows title bar. This allowed us to enable the option to also select from chains and to select a drug as a wireframe model for those structures of a protein:ARV drug complex. Interactions/Sites window C. Conservation of the residues is shown in the Homology window. The conservation slide threshold is set to 99% amino acid identity and yellow residues are conserved among 50,017 viral sequences shown here. D. DRM window with DRMs for Saquinavir colored. The coloring scheme for the DRMs is beneficial (green), beneficial set (light green), primary (red), primary set (pink), secondary set (purple) G. Information for each DRM is shown in a table that is color coded using the same DRM coloring scheme. DRMs for different drugs can be loaded using the pulldown menu at the bottom of the table. This table also provides the original amino acid, position, mutated amino acid, and links to the abstracts of PubMed papers supporting the DRM. The first column of this table is interactive, where a mouse click identifies the amino acid in the structure of the DRM window (D). E. Drug Binding Site window showing the structure of protease with the binding site for Saquinavir colored. The coloring scheme for the DRMs is as in Fig. 2 with an additional orange color for binding site residues that do not have a known DRM (orange). H. Information for each Drug Binding Site Residue is shown in a table that is color-coded using the same coloring scheme as in E. A distance threshold between atoms of the drug and atoms of the protein (2.5-4.0 Å ) can be set using a pulldown menu; 4.0 Å was set in this figure. This table provides the amino acid position, shortest distance to a drug atom, whether it is a DRM, and the type of DRM. The first column of this table is interactive, where a mouse click identifies the amino acid in the structure of the Drug Binding Site window (E). F. Epitope window showing protease with the immune epitope KMIGGIGGFI colored green. Different positive immune epitopes for the loaded HIV protein from the IEDB can be selected using a pulldown menu on the top of the window that shows the IEDB id number and peptide sequence 4 . doi:10.1371/journal.pone.0098810.g005 In HIVToolbox2, we have added three new additional synchronized interactive structure displays for viewing drug resistance mutations (DRMs), drug binding sites, and immune epitopes. As with the other three structural displays, a mouse can be used to rotate or zoom, in addition to revealing the identification of the atom by hovering the mouse cursor over any region of the protein structure. A mouse right click reveals a menu with JSmol commands and the option to open a JSmol console. All six structure displays are synchronized and interactive using JSmol commands.
The new Drug Resistance Structure window ( Fig. 2A) is initially loaded with a default structure for each protein:ARV complex, if one exists in the PDB. The DRMs in the drug resistance display are colored by a new DRM classification scheme ( Table 1) where red = primary (a DRM that can cause observable resistance by itself), pink = primary set (a group of mutations that can cause resistance when the occur together), green = beneficial (a mutation that increases drug susceptibility), dark green = beneficial set (a set of mutations that together increase drug susceptibility), and purple = secondary set (which is one or more mutations that can enhance resistance when combined with a primary or primary set of mutations).
The Drug Resistance Mutation display also has a drop-down selection menu that allows selection of DRMs for a single drug to be displayed ( Fig. 2A). The known DRMs are listed in the Drug Resistance Mutation log window with their position, drug, mutation, classification type, and hyperlink(s) to primary reference(s); rows are colored by resistance classification type. The table is interactive, where selecting the DRM identifies the location of the mutation in the Drug Resistant Mutation window with a temporary flash. Concurrently, the DRM is centered and zoomed to show the DRM (Fig. 2A). The DRMs for all ARV drugs are shown upon the initial loading of protein selected from the menu. A menu selector can be used to select a specific drug, and Load DRM button at the bottom of the Table enables loading of the selected ARV drugs.
The new Drug Binding Sites structure window shows a surface plot with drug-binding site residues (Fig. 3A). The residues are colored like the DRMs, except that contact residues, for which there are no known drug resistance mutations, are colored orange. The drug is shown as a wireframe figure. A distance threshold can be selected from a pulldown menu below the Drug Binding Site Log window and then loaded (Fig. 3B). This threshold is for residues with an atom that makes contact with an atom of a bound drug within a specific distance. The distance threshold can be varied between 2.75 Å and 4.0 Å in 0.25 Å increments. The Drug Binding Site Log window shows the protein chain and position, distance to the closest atom in the drug, whether it is a known DRM, and the DRM classification type. Each row is colored by the class of DRM. Selection of the residue in the table shows the location of the residue in the structure window with a temporary flash, and also re-centers and zooms the structure to show the binding site residue.
The new Immune Epitope structure window has positive immune epitopes colored on the surface of an HIV protein structure (Fig. 4A). Immune epitopes and their identifiers from the HIV Immune Epitope database 2.0 can be selected from a pulldown menu above the window or by selecting the epitope from the Epitopes Log window (Fig. 4B) [25]. If the shift key is held down while selecting multiple epitopes from the log window, multiple epitopes can be shown concurrently. The table also has the epitope ID and hyperlink to the entry in the Immune Epitope Database.
The six interactive structural displays are organized for direct comparison (Fig. 5A-F). These are interactive with the three adjacent log windows (Fig. 5G,H; the Epitopes Log window that is not shown here). This layout facilitates interpretation of data in the context of structure, function and sequence conservation. The new structure windows in HIVToolbox2 provide a new means to study HIV pathogenesis, and relations to immonology.
Several data items in the HIVToolbox2 database have been updated ( Table 2). We have added additional sequences from the 2012 Los Alamos HIV Sequence database [10]. The HIVTool-box2 database now contains ,502,000 HIV protein sequences from different patient blood samples. HIVToolbox was updated and now contains ,1200 structures of HIV proteins, including several new structures of protein:ARV drug complexes from the PDB [12]. We calculated all residues in HIV protein that were within 3.5 Å of an atom in the complexed molecule to create binding sites that were entered in the HIVToolbox2 database as new protein-protein interactions or for non-protein molecules as new sequence features. Some additional functions associated with sequence elements, which were identified in the literature, were added to the database. For all annotations, we now provide a hyperlink to a PubMed abstract that identified the interaction. The HIVtoolbox database is updated at least annually, which we plan to continue.

New workflows enabled in HIVToolbox2
Workflows #1-16. Six integrated structural viewers make it easy to compare different types of data with regard to sequence, structure, function, sequence conservation, drug resistance and immune epitopes. The 16 different types of pairwise comparisons enabled are shown in Table 3. Workflows 4-16 are now enabled in HIVToolbox2. One example from these 16 workflows is shown for a HIV protease:Saquinavir complex in Fig. 5. This example of multiple comparisons shows that the T82 residue (arrows) is in a region that is not conserved (panel C -blue residues are not conserved) that is outside the active site (panel B) is a beneficial mutation (panels D, G -green) that makes contact with the drug (panels E, H) and is an immune epitope #40375 (panel F). Different aspects of workflows #17-21 described below are enabled in HIVToolbox2 and were not possible with HIVToolbox.
Workflow #17: Predicted effectors of HIV protein multimerization. Most HIV proteins form multimers required for their activity ( Table 4). We considered that multimerization could potentially be regulated by other functional sites in proteins. Therefore, we looked for functional sites within the multimerization interface in different structures of HIV proteins. We noticed a common pattern where phosphorylation sites were present at sites of subunit interactions in structures of Vif, Rev, Tat, and Matrix multimers [26][27][28][29]. We identified some proteinprotein interaction sites in Nef, Rev, Vif, and Vpr that overlap with the multimerization interface. Thus, they may be involved in HIV protein oligomerization and activity [26,27,30,31]. The Protein Sequence window can be used to investigate known and predicted minimotifs that overlap with HIV protein oligomerization sites.
Workflow #18: Identification of overlapping or nonoverlapping functionalities to generate new hypotheses. Consolidation and integration of the functional information in HIVToolbox2 can facilitate experimental design and interpretation.
One of the best examples of how coordination of data can be used to generate new hypotheses comes from examination of Tat with HIVToolbox2 (Fig. 6). The HIV Tat transcription factor is a potential drug target [32]. Examination of the Tat sequence shows a functional hotspot between residues 15-57 (Fig. 6C, blue shaded box). In this region, there are binding sites for ,30 different proteins and multiple types and sites of posttranslational modifications (PTMs). These residues are some of the mostly highly conserved regions in Tat (Fig. 6B). There are several examples in this region of Tat where functional sites are known to compete with each other [33].
Structure mapping of sites on Tat with HIVToolbox2 (Fig. 6A) allows evaluation of which proteins or PTMs have residues that overlap other sites. These are expected to be competitive functions, in many cases. Several previously unknown examples of such functional overlaps are easily recognized. The Cyclin T1 and CDK9 binding sites overlap with an ADP ribosylation site. Tat also binds p53, which overlaps with several sites (Karopherin beta, Proteosome alpha 1, and DNA directed RNA polymerase II binding sites, as well as RNA binding site, and protein methylation sites and acetylation sites). From a compatibility perspective, the p53 and TBP associated factor 1 binding sites are adjacent to, but don't overlap with, the Tat dimerization site and Cyclin T binding sites. However, the TBP and p53 do have overlapping residues. There are far too many combinations to discuss here. But clearly, this tool is a source for better understanding the multiple roles of Tat. HIV2Toolbox2 helps interpret results as demonstrated by examining the hot spot region of Tat.
Workflow #19: Known and predicted minimotifs in HIV proteins. HIV Rev binds the Rev Response Element (RRE) in the HIV RNA genome and facilitates transport of the genomic RNA from the nucleus to the cytosol. Rev has known sequence elements associated with dimerization, phosphorylation, methylation, RNA binding, and ubiquitination. We examined Rev for minimotifs to demonstrate the utility of this type of workflow. The region of Rev between P76-L83 seems to be multifunctional, binding four different proteins. This region is not in the dimerization site or other functional sites. This region of Rev binds ArfGAP, a protein involved in nuclear export [34]. The nuclear export function seems to have redundancy with an overlapping NLP1 binding site, which serves as a bridge protein to bind Exportin 1 for nuclear export [35]. These are consistent with the known roles of Rev in export of the genomic HIV. This region also binds to prothymosin a, a protein involved in transcription, and Sam68, another RNA binding protein that is involved in HIV genomic RNA export, as well as in translational regulation of HIV RNA [36]. Given that there are four different binding proteins for this site, and that Rev forms dimers, it is currently unclear if Rev forms heterotetramers with two of its binding partners, and, if so, with which pairs of proteins. This is may be an important facet of Rev function.
Workflow #20: Global resistance landscapes. As an example of a global resistance landscapes, we examined HIV protease inhibitors using HIVToolbox2 (Fig. 7). This type of analysis demonstrates the utility of both the new DRM classification scheme and the HIVToolbox2 tool. When we examine the distribution of the DRMs on the protease surface plots for all FDA approved drugs that target HIV protease, several resistance patterns become apparent. All known primary mutations are in the drug-binding pockets of the drugs. Primary set mutations contain residues that are either in the binding pocket or immediately juxtaposed, but only on one face of the protease. Beneficial or beneficial set mutations are clustered near the active site but in a region overlapping with the primary set mutations. Secondary-set mutations generally overlap with a region containing primary set mutations. Mutations are observed in the active site and in residues that form a flap covering the active site, but never in the dimerization residues. The active site, flap, and dimerization site residues are highly conserved, whereas many residues in the primary set and beneficial regions have lower conservation levels (as little as 85% in ,50,000 HIV-1 protease sequences).
Workflow #21: Examining amino acid frequencies by HIV subtype. A useful feature of HIVToolbox2 is that it enables the ability to view mutations and their frequencies in specific viral subtypes. This can be accomplished for any known amino acid in an HIV protein by using the pulldown menus at the bottom of the Sequence window, selecting the Clustal Alignment in the Sequence Alignment section, and then selecting the PSSM. The frequencies are calculated from the data in the Los Alamos HIV Sequence database, which features data that is not collected in a single standardized epidemiological study, but does provide a rough snapshot of mutation prevalence in each subtype.
To show the utility of this tool, we examined the beneficial and primary DRMs for HIV drug resistance in protease ( Table 5). In this analysis, we used NP_705926 as the reference sequence. Some interesting patterns were apparent. The L10V Beneficial set DRM for Atazanavir is prevalent in the F1 subtype, but this must occur with L24I, which is only in 4% of the Subtype F1 sequences. The K20I beneficial DRM for Darunavir is in most of the 612 subtype G sequences. Although this was previously known as a beneficial mutation, it was not known to be prevalent in Subtype G viruses [37]. The V82A beneficial DRM for Darunavir and beneficial set for Atazanavir [37][38][39] is prevalent in the B and F1 subtypes (19-25% of sequences). The M46L is also abundant in subtype B. This type of subtype analysis can also be performed for any minimotif, functional site, immune epitope, protein-protein interaction, and drug binding site residue with HIVToolbox2. Availability, video tutorials and user guide.
HIVToolbox2 is an open-access web application available at http://hivtoolbox2. bio-toolkit.com. The application has been tested on all major web browsers and operating systems. A Help page for HIVToolbox2, with a summary, funding, video tutorials, user guide, research papers and contact is at http://www.bio-toolkit.com/ HIVToolbox/project. The SQL database of drug resistant mutations is available upon request.

Discussion
Our second release of the HIVToolbox provides both data updates and new functions enabling 21 different types of workflows; only three were possible with the original HIVToolbox. As well as our previous focus on sequence, structure, function and conservation, we have added information related to HIV pathogenesis: HIV drugs, drug resistance and immune epitopes. By using HIVToolbox2 to explore some of these workflows, we have identified some interesting aspects of HIV proteins that become more obvious once all the data is integrated and visualized. These include the following findings: (1) almost all HIV proteins form homomultimers; (2) host proteins bind or covalently modify interfaces of HIV protein homomultimeration; (3) HIVToobox2 helps with interpretation of complex interaction interfaces in proteins like Nef and Tat; (4) a protease drug resistance landscape reveals a distinct resistance anatomy; and (5) some DRMs are much more prevalent in some subtypes.

HIV protein multimers
Although multimerization has been studied for individual HIV proteins, our consolidation of data for HIV structures has helped emphasize that most HIV proteins form some type of homoligomers. To our knowledge, this has not been previously reviewed. Protease, RT, Nef, Rev, Tat and Vif can form dimers. Env, GP120, GP41, Capsid, and Vif can from trimers, and Capsid and matrix can form hexamers (Table 4). Nucleocapsid, p6, and Vpu are not known to multimerize. The HIV homomultimers are, in most cases, essential for activity of the protein, and multimerization has been extensively investigated as a mechanism of inhibition of replication [40][41][42][43][44][45][46][47].
The other interesting aspect of HIV protein multimerization is that several posttranslational interactions and interactions with host proteins are within HIV homomultimerization interfaces and expected to compete ( Table 4). This observation suggests that host factors may play an important role in controlling where and when HIV proteins multimerize, thus controlling their activity. This is interesting because one general approach in inhibiting HIV replication has been to generate peptides or compounds that block multimerization of key HIV proteins [40][41][42][43][44][45][46][47].

Tat interpretation
As knowledge of protein function grows, it becomes clearer that some regions of proteins are very complex. For example, a hotspot of interaction has been identified in HIV Nef [48]. In integrating data, this becomes apparent for Tat, where there are over 30 protein-protein interaction and posttranslational modifications in a 32 amino acid region. Many scientists model highly complex proteins in networks, where Tat and other proteins with many interactions are considered hubs. HIVToolbox2 advances the analysis of Tat as a hub protein by enabling rapid interpretation in the context of structure. The structure can be used to derive sets of rules for the hub network node that can be tested. An example of a rule that can be extracted from the HIVToolbox2 interface is ''Methylation at K51 overlaps with RNA binding site, thus one rule would be that K51 methylation and RNA binding on the same Tat monomer are mutually exclusive.''

HIV protease resistance landscape
A new feature in HIVToolbox is the ability to view DRMs mapped onto the surface of protein structures. Fig. 7 shows a comparison of DRMs for various FDA-approved HIV protease inhibitors. This analysis, when combined with an extended DRM classification scheme, reveals an anatomy of resistance in protease. Each type of DRM is localized to a specific region of protease. Furthermore, drug resistance mutations have not yet been observed near the dimerization or nitrosylation sites. The observation of such a global pattern is not easily recognized without the visual mining enabled by HIVToolbox2. We note that the region covered by 4 protease immune epitopes is inclusive of the regions that have primary and primary set mutations. This resistance anatomy may prove useful for pharmaceutical companies in designing future ARVs that are less susceptible to drug resistance.

DRM prevalence in HIV-1 subtypes
The original HIVToolbox had a function to look at sequence from blood samples for different HIV subtypes. By including DRMs in HIVToolbox2, we could now examine how different DRMs were distributed among different HIV subtypes. These observations must be considered with caution, as the sequence data were not collected as a single epidemiological study, but rather are a compendium of many different studies and samples. Nevertheless, there were some interesting observations (Workflow 21, Table 5). The V82A DRM, which is beneficial for Darunavir and part of a beneficial set for Atazanavir, was in 19-25% of subtype B and F1 samples [37,38,49].

Conclusions
HIVToolbox2 updates the original HIVToolbox with new data, new functions and improved ease of use. Data integration and the new functions enable many new types of workflows that have resulted in several new global observations: (1) most HIV proteins form higher order homomultimers; (2) many multimerization interfaces have posttranslational modifications or protein-protein interactions that may compete with or enhance multimerization; (3) HIV protease has a global resistance anatomy; (4) protein structure can be used to help examine network hub proteins such as Tat; and (5) some DRMs are more prevalent in specific Class M subtypes.

Software engineering
HIVToolbox2 was built as a standard, three-tier J2EE web application consisting of 1) an underlying relational MySQL database, 2) a set of standard Java data access objects that pull data from the database, and 3) a set of dynamic interactive web pages. Several classes were translated from Java to JavaScript so that the structure interaction interface is generated on the client side, instead of the server side. This is better suited to cross-browser and cross-platform compatibility.

Data sources
HIV-1 data from external sources such as the Protein Data Bank, NCBI, Los Alamos HIV sequence database, etc. was collected, curated, and stored in the HIVToolbox2 database. The HIVToolbox2 database has ,502,000 total sequences for HIV blood samples from 126 different countries [22]. These sequences were derived from nucleotide sequences from the Los Alamos HIV sequence database, which were converted into amino acid sequences using BioJava 3.03 (http://www.biojava.org).

Distance and frequency calculations
In order to identify amino acids that contact atoms in the drug we used BioJava. Distance thresholds were set from 2.5-4.0 Å in 0.25 Å increments. The pre-calculated distance data is stored in MySQL tables and returned upon client requests. The residue frequencies were calculated from multiple sequence alignments as previously done using ClustalV for clade specific alignments in the HIVToolbox database [50]. The pre-processed data for the frequency of amino acids for DRMs are stored in a MySQL table.