Interactive Data Visualization for HIV Cohorts: Leveraging Data Exchange Standards to Share and Reuse Research Tools

Objective To develop and disseminate tools for interactive visualization of HIV cohort data. Design and Methods If a picture is worth a thousand words, then an interactive video, composed of a long string of pictures, can produce an even richer presentation of HIV population dynamics. We developed an HIV cohort data visualization tool using open-source software (R statistical language). The tool requires that the data structure conform to the HIV Cohort Data Exchange Protocol (HICDEP), and our implementation utilized Caribbean, Central and South America network (CCASAnet) data. Results This tool currently presents patient-level data in three classes of plots: (1) Longitudinal plots showing changes in measurements viewed alongside event probability curves allowing for simultaneous inspection of outcomes by relevant patient classes. (2) Bubble plots showing changes in indicators over time allowing for observation of group level dynamics. (3) Heat maps of levels of indicators changing over time allowing for observation of spatial-temporal dynamics. Examples of each class of plot are given using CCASAnet data investigating trends in CD4 count and AIDS at antiretroviral therapy (ART) initiation, CD4 trajectories after ART initiation, and mortality. Conclusions We invite researchers interested in this data visualization effort to use these tools and to suggest new classes of data visualization. We aim to contribute additional shareable tools in the spirit of open scientific collaboration and hope that these tools further the participation in open data standards like HICDEP by the HIV research community.

Data Availability Statement: The authors confirm that, for approved reasons, some access restrictions apply to the data underlying the findings. Complete data for this study cannot be posted in a supplemental file or a public repository because of legal and ethical restrictions. The Principles of Collaboration under which the CCASAnet multinational collaboration was founded and the regulatory requirements of the different countries' IRBs require the submission and approval of a project concept sheet by the CCASAnet executive committee and the principal investigators at participating sites. All

Introduction
In the practice of epidemiology, data visualization has been of great importance historically [1], whether for exploration of data structures preparatory to analysis [2], for interpreting patterns of events in populations over space and time [3], or for more clearly communicating inferences drawn from completed analyses [4]. Data visualization has also been important in our understanding of the HIV epidemic [5,6]. Data animations can improve figures by allowing the display of a temporal dimension [3]. In many static plots, the requisite data dimensions consume all the display space precluding the opportunity to add the temporal dimension without compromising the clarity and effectiveness of conveyed information. Static snapshots taken of plots at regular time intervals can be strung together to form frames in a video animation, the direction and speed of which can be altered by the user. For example, recent work elucidated the CD4 and viral load response to antiretroviral therapy using a dynamic visual display [7].
While various data visualization techniques in related domains, including geographic information systems [8,9], social networks [10], and bioinformatics [11] have been proposed and analyzed, they mostly require loading the data into tool-specific stores and formatting that data according to ad hoc syntax. A recent systematic review of data visualization tools for infectious diseases suggested that future developers focus on the broader contexts of available data, team collaboration, and interdisciplinary needs [12]. Existing tools attract users when they are free, interactive, transparent, and have a limited learning curve. We maintain that an open standard unifying syntactic and semantic definitions coupled with an open set of data analytic and visualization tools would provide sufficient incentive for the community to incrementally build and enhance such tools [12][13][14][15][16].
We describe a tool built with open access software and data exchange standards to promote visualization of HIV cohort data. We identified classes of regularly used plots for which an additional temporal dimension-displayed through interactive animation-can increase their appeal and explanatory power. We demonstrate the tool using HIV cohort data from the Caribbean, Central and South America network for HIV research (CCASAnet).

Cohort Description
CCASAnet is a shared repository of HIV cohort data from sites in Argentina, Brazil, Chile, Haiti, Honduras, Mexico and Peru. The collaboration was established in 2006 as part of the International Epidemiologic Databases to Evaluate AIDS (IeDEA; www.iedea.org) with the purpose of collecting retrospective clinical HIV data to describe the unique characteristics of the epidemic in the region [17]. The Vanderbilt University Medical Center Institutional Review Board approved this project. Local centers de-identified all data before transmitting it to the CCASAnet Data Coordinating Center at Vanderbilt University, so no informed consent was required.

Data Exchange Standard
Cross cohort collaborations have long been hindered by utilizing different protocols for data exchange. In an effort to reduce the workload of data extraction and speed up the time to analysis, an HIV Cohort Data Exchange Protocol (HICDEP, available at http://www.hicdep.org/) was developed and widely disseminated in 2004 [18]. In 2010, CCASAnet adopted a data transfer protocol based on HICDEP to support and streamline data harmonization between the multiple sites. The CCASAnet Data Coordinating Center has leveraged this open standard to build a suite of data visualization tools that can be shared with the HIV cohort community and beyond as open source tools. While the results in this paper use actual CCASAnet data, example datasets have been made available to readers in order to practice using the tools highlighted in this paper (http://biostat.mc.vanderbilt.edu/ArchivedAnalyses).

Data Visualization
Currently, there are three classes of plots requiring patient-level or country-level data (described below). The graphics are implemented using R statistical language and encoded using MEncoder. The R code may be downloaded from our GitHub repository (https://github. com/CCASANET/dataviz), applied to HICDEP compliant data, and customized as indicated in the written instructions or with the aide of an instructional video (http://biostat.mc. vanderbilt.edu/ccasanet/dataviz/instructions.htm). Current classes of plots are the following: 1. Longitudinal plots / event probability curves. This panel of graphics was motivated by common figures used to describe HIV therapy outcomes, including spaghetti plots, density curves, and Kaplan-Meier plots [19][20][21]. Longitudinal measures (e.g., CD4 count) with smoothed curves are viewed alongside event probability curves allowing for simultaneous inspection of outcomes (e.g., mortality) stratified by patient classes (e.g., AIDS status at ART initiation). The smoothed LOESS curves are fit once over the whole time span using locally-weighted polynomial regression [22]. Density curves are shown in the margins demonstrating changes of the longitudinal measure as its trajectory grows in the frame; these changes in the density cannot be effectively visualized in a single static frame. Inputs per subject include: a longitudinal continuous measure and dates (e.g. CD4 count), an event indicator and corresponding date (e.g. death), a start date (e.g. combination antiretroviral therapy [cART] initiation), and a classifier (e.g. AIDS).
2. Bubble plots. Inspired by Hans Rosling's popular TED talks on world population statistics and his Gapminder project [23], this graphic shows changes in indicators over time allowing for observation of group level dynamics. Bubble plots show three dimensions of data including the placement on each of two axes and the size of the bubble, and a fourth dimension is added by showing the change over time in video presentation. This graphic takes as input: two indicators (e.g. CD4+ count < 200 cells/μL or AIDS diagnosis), one date (e.g. enrollment into HIV care), and one classifier (e.g. study site).
3. Heat maps. World maps are commonly used to show the burden of global HIV disease [24,25]. By displaying these maps over time, we can simultaneously view the spatial element of cohort data along with the population trends. A heat map shows borders of countries filled in with darker colors for high proportions and lighter colors for low proportions. This graphic takes as input a dataset with one indicator record for each country and year. There is a sample R script (cd4_base_country.R) that demonstrates how a user might generate this country-level dataset using patient-level data.
The three classes of plot were tested independently from the developer (MB) by two CCA-SAnet members (YCV and MJG). The step-by-step instructions for users include:

Results
Examples of the output of three classes of plots based on CCASAnet data are summarized below; animations are best visualized at our website (http://biostat.mc.vanderbilt.edu/ccasanet/ dataviz/examples.htm) and frames from the example animations are provided in Fig 1. All plots come with example user specifications as outlined in Table 1; users may directly edit the specifications in a CSV document to change the various inputs and parameters for the plots as detailed below.
Panel 1: Immunologic recovery and mortality following cART initiation, stratified by clinical stage at ART initiation Fig 1A shows the final frame of these longitudinal plots / event probability curves applied to 17,517 patients. Each frame corresponds to a 1-day increment from date of cART initiation. The top panel is a scatterplot of days on cART by CD4; observed values are marked with a semi-transparent dot at the day of observation and an X at the day of death (the last observed CD4 count is used as long as it was recorded within 12 months of death). Density curves show the most recent distribution of CD4. The bottom panel shows Kaplan-Meier estimates of the probability of death. In the CCASAnet cohort, patients initiating cART immediately separate into two groups, higher CD4 and lower CD4 with AIDS status at initiation, and patients initiating cART with AIDS have increased probability of death. It is interesting to also note that conditional on survival past one year, the CD4+ counts for these groups become similar. This data visualization may be useful to look at sex or age-group differences in HIV care and treatment outcomes.  In our example (Fig 1B and 1C), each frame corresponds to calendar year of enrollment into HIV care. In Fig 1B,   In our example (Fig 1D and 1E), the proportion plotted corresponds to patients diagnosed with CD4+ cell count < 200 cells/μL. The first map (Fig 1D) shows the entire world to give context to the countries in the cohort, and a second map (Fig 1E) is produced that highlights only those countries with data in any of the time periods. Both maps are created by the same set of input specifications and R code. It is necessary to input a dataset that has one record for country and time period; the creation of this dataset may be aided using the example code in cd4_base_country.R.

Discussion
Our goal is to enable HIV researchers to create interactive visualizations of large HIV cohort databases which inspire insight into and even awe at the dynamics of HIV outcomes. Longitudinal plots / survival curves can be used to view changes in CD4+ cell count, HIV viral load, hemoglobin, and other continuously varying measures and the probability of AIDS-defining events, loss to follow-up, death, and other endpoints. Bubble plots can be used to visualize movement in key indicators across relevant groupings over various time periods. Heat maps can be used to provide spatial and temporal context to HIV cohort data. Commonly used graphics in the field of HIV cohort research are made interactive and accessible using open source tools and data exchange standards.
Variations of these visualizations have been incorporated as supplemental figures in CCASAnet manuscripts [27,28]. Future directions include enhancing the suite of tools with additional classes of data visualization, such as the recently released dynamic visual display of treatment response [7]. While R has an arguably steep learning curve, we have mitigated this through the written and video instructions, hands-on dataset, and by allowing user-modified graphics with simple text file inputs versus editing R scripts. A logical next step in flattening the learning curve of R would be the design of a graphical user interface (GUI) that would allow the user to input the specifications interactively as opposed to editing the specifications in a CSV document. A GUI might also display descriptive statistics to supplement the visualization, or optionally allow for more comprehensive displays of uncertainty such as confidence intervals. Depending on the technical platform, further user interaction with these animations would include the ability to control the direction and speed of the time lapse animation, the ability to highlight and track elements of the plot, and the ability to control parameters that mask or compare alternate scenarios. Researchers interested in this data visualization effort are encouraged to contact the authors and are invited to contribute ideas for additional interactive visualizations that may be openly implemented as part of this research tool set. Building on open standards like HICDEP, we aim to contribute additional shareable tools in the spirit of open scientific collaboration.