A Human "eFP" Browser for Generating Gene Expression Anatograms

Transcriptomic studies help to further our understanding of gene function. Human transcriptomic studies tend to focus on a particular subset of tissue types or a particular disease state; however, it is possible to collate into a compendium multiple studies that have been profiled using the same expression analysis platform to provide an overview of gene expression levels in many different tissues or under different conditions. In order to increase the knowledge and understanding we gain from such studies, intuitive visualization of gene expression data in such a compendium can be useful. The Human eFP (“electronic Fluorescent Pictograph”) Browser presented here is a tool for intuitive visualization of large human gene expression data sets on pictographic representations of the human body as gene expression “anatograms”. Pictographic representations for new data sets may be generated easily. The Human eFP Browser can also serve as a portal to other gene-specific information through link-outs to various online resources.


Introduction
Global gene expression profiling studies offer an unparalleled opportunity to further our understanding of gene function. In particular, the ability to decipher when a given gene is expressed, and to what level in certain tissues and developmental stages can prove useful for human biomedical studies. It has been estimated that the human genome contains~21,000 protein-coding genes [1], with more recent estimates putting this number even lower at 19,000 [2]. Experimental protein-level evidence for at least 30% of the~21,000 genes is lacking [3], leaving a sizeable void in our understanding of gene function. Gene expression profiling can help bridge this gap, by generating experimental evidence that a given gene is at least transcribed.
Expression levels of human genes vary across a multitude of tissue types, developmental stages and disease states. Typically, studies have focused on a particular subset of these conditions, but "atlas"-type resources such as the Genomics Institute of the Novartis Research Foundation (GNF) Gene Expression Atlas (Su et al., 2004) that encompasses a wide variety of tissue types and disease states have also been generated. Integration of a number of independent microarray studies covering a wide variety of biological conditions is challenging but possible as long as they have been sampled using the same platform [4]. We have integrated several such studies found both in the Gene Expression Omnibus (GEO, [5]) and ArrayExpress [6]. This includes samples from the GNF Gene Expression Atlas as well as the following series: GSE475, GSE2361 [7], GSE3526 [8], GSE8961 [9], GSE4567 [10], GSE7307 [11], GSE19650 [12], E-MTAB-47 [13], E-GEOD-6257 [14], and E-MEXP-2219 [15]. In total, 774 samples from 11 different data sets have been collated. In addition to this, the RNA-Seq Illumina Human BodyMap 2.0 data set ( [16]; Ensembl Release 70) containing 16 different samples has been added to the Human eFP Browser, showing the flexibility of this tool to enable viewing of data from different platforms (expression levels for a given gene and tissue combination are not directly comparable if generated by different platforms-a message at the top of the Illumina Body Map 2 view alerts users to this fact).
Ultimately, in order to maximize the potential that gene expression studies offer, the ability to rapidly and easily interrogate these data sets is necessary. The interpretation of the gene expression level values should also occur in a coherent and user-friendly manner. Many online resources exist that enable a user to visualize gene expression levels in a data set for a given gene. Such tools include BioGPS [17], EBI Expression Atlas [18], GeneCards [19], Human Protein Atlas [20], GEO Profiles [5], TiGER [21], and Genevestigator [22]. However, these tools don't provide biological context: outputs are bar graph or heatmap visualizations, with the name of the sample being the only, often somewhat cryptic, indication as to what kind of tissue or cell type that sample was generated from. A more informative way to visualize such data would be to show the level of expression in an anatomical sense, thus lending some context to the data. While the Expression Atlas tool at the EBI [23] does provide a representation of the human body for the Illumina Human Body Map 2.0 data set [16], where the corresponding body part is highlighted if a user moves his/her mouse over the gene expression value of interest, eye saccades and top-down processes [24] are required to actually determine to which part of the body a given expression value belongs. This user interface also fails to provide anatomical context for smaller structures within tissues.
Here, we present a tool that enables the user to visualize large-scale human gene expression data sets directly on representations of the human body-the Human eFP Browser at http://bar. utoronto.ca/efp_human/, which is based on an open source framework developed by Winter et al. (2007). Current data sets in the Human eFP Browser were sampled on the HG-U133A and HG-U133 Plus 2 arrays (Affymetrix Inc., Santa Clara, USA), and by RNA-seq in the case of the Illumina Body Map 2 view. The user is shown diagrammatic anatomical representations that correspond to those areas of the body that were used to generate the RNA samples described above (currently categorized into five different views). The normalized gene expression data are stored on the Bio-Analytic Resource (BAR) server [25]. The user enters an Entrez gene identifier, a gene symbol, or a probe set identifier, and then chooses the mode of interpretation (absolute, relative, or compare). After clicking "Go", the representations of human samples are coloured based on the expression level of the gene of interest, generating expression "anatograms" for rapidly determining where a given gene is most strongly expressed. A yellow-red scale is used in the "Absolute" mode to depict expression levels, with yellow denoting no expression in a given depiction of a tissue and red denoting maximal expression. "Relative" mode displays the ratio of the expression level of a given gene relative to a control level (the median expression level for that gene across all samples in a particular view). The colour scale used in this instance is yellowred for values above the control level, and yellow-blue for values below the control level. In "Compare" mode the primary gene expression level is compared to that of the secondary gene expression level, and the colour scheme is the same as in the "Relative" mode. Information regarding the view with the highest level of gene expression is given near the top of the view, and information regarding probe set/gene identifiers as well as functional annotation attributed to the query gene is given at the bottom. Since gene expression data are given anatomical context, further interpretation is allowed and data become more accessible to users who may not be completely familiar with all parts of human anatomy. The Human eFP Browser is intended as a rapid and easy means for visualizing gene expression data sets to identify gene expression patterns of interest and facilitate hypothesis generation. Gene-specific link-outs are also provided to corresponding gene records in BioGPS [17], the Gene database at NCBI [26], UniProt [27], EBI, and GeneMANIA [28]. Thus the Human eFP Browser can also serve as a portal to gene-specific information. We have also worked with the curators at NCBI such that link-outs to the Human eFP Browser are available from the human Gene pages at NCBI.

Results
In order to demonstrate the utility of the Human eFP Browser, we present examples of genes whose expression patterns have been published. The first example output shown in Fig 1 is for the insulin (INS) gene, which is expressed most highly in the pancreatic β islet cells [29]. Here, the gene symbol ("INS") was entered, "Absolute" mode was selected and the "Skeletal Immune Digestive" data source was also selected. The output for this gene shows expression exclusively in the pancreas / islet cells. Also any functional annotation attributed to the gene is given (not shown). Direct links to the records for the INS gene in BioGPS, NCBI, UniProt, and EBI are provided at the top of the output.
A second example output is shown in Fig 2 and is for the SIX homeobox 3 (SIX3) gene, which is associated with developmental abnormalities in the forebrain [30]. The highest levels of gene expression are found in the putamen and nucleus accumbens. Again, additional information related to this gene as well as link-outs to other resources are provided.
The calcium/calmodulin-dependant protein kinase II beta (CAMK2B) gene is the final output example and its expression patterns are shown in Fig 3. It is involved in neuronal plasticity and synapse formation [31]. In the RNA-Seq Human eFP Browser view, highest expression levels are found in the brain and to a lesser extent in the skeletal muscle. In this view, it is also possible to view related information and link outs to other resources.

Discussion
When considering global microarray or RNA-seq gene expression profiling studies, gene expression levels are a useful guide to that gene's biology. The Human eFP Browser provides users with the ability to easily visualize and rapidly interpret the results of gene expression studies in humans. While many human gene expression studies focus on a particular area of the human body, this tool enables the user to interpret gene expression levels across multiple tissue types. Moreover, for users who are less familiar with human anatomy, such expression data sets will become more accessible as the data are given anatomical context, as opposed to being shown as a bar graph.
In order to provide examples of the utility of the Human eFP Browser, we chose three genes that are expected to show high levels of gene expression in specific tissues. INS shows highest expression in the islet cells (Fig 1), while SIX3 shows highest expression in the putamen and nucleus accumbens (Fig 2), and CAMK2B shows highest expression in the brain (Fig 3). These examples show the utility of this tool for visualizing gene expression data sets (both microarray-and RNA-seq-based).
At present, link-outs are provided several common repositories for gene information in order to provide further details at the click of a mouse. Users can also access the relevant experiment records in GEO by clicking on individual tissues on the image. Additionally, on mouseover the tissue name and expression value (absolute, or relative with fold-change or standard deviation) is displayed. Underneath the main image, a link is provided to a table listing all sample names, expression values, fold-changes, and standard deviations, as well as a chart showing the same information. Gene specific link-outs to entries in other databases can be found above the main image. In the future, as more human gene expression experiments are conducted, we envisage adding further data sets and views to this tool, including those that have been profiled on other platforms. Current and future activities involve adding further developmental data sets, as well as disease data sets e.g. cancer gene expression studies, into the Human eFP Browser. In this way, the Human eFP Browser can become a comprehensive resource for visualization and interpretation of human gene expression data and an aggregator of link-outs to various other resources. We encourage any researcher to contact us with ideas for specific views.

Materials and Methods
A number of human microarray data sets are represented within the Human eFP Browser. From GEO, the following data sets are represented: GSE1133, GSE475, GSE2361, GSE3526, GSE8961, GSE4567, GSE7307, and GSE19650. Other data sets are from ArrayExpress: E-MTAB-47, E-GEOD-6257, and E-MEXP-2219. All microarray data sets were normalized in R/Bioconductor using the MAS 5 method with a target value of 100 with the following commands: #Load affy package > library(affy) #Set working directory to directory containing the data you wish to normalize #Write the data to a csv file > write.exprs(GSE35261Norm, file = "GSE35261Norm_tgt100.csv") The RNA-Seq FPKM processed data set was processed by Eric Minikel of cureFFI.org (http://www.cureffi.org/2013/07/11/tissue-specific-gene-expression-data-based-on-humanbodymap-2-0/). The processing by Eric Minikel prior to our download was as follows: Ensembl BAM files were downloaded. Cufflinks was used to summarize expression levels as FPKM values. Only known transcripts were called.
The Human eFP Browser is implemented in Python, and inputs include a Targa-based image, XML control file, gene identifier to microarray probe set lookup and annotation databases, and a gene expression database for the given samples. These components work together to produce an output image, as described in Winter et al. (2007). The eFP Browser open source code is available at http://sourceforge.net/projects/efpbrowser/ and original expression data may be downloaded from GEO or ArrayExpress using the accession numbers on the previous page. Processed data are at https://github.com/asherpasha/eFP_Human_Databases under the DOI of 10.5281/zenodo.45940.
Illumina Human BodyMap 2.0, under the Creative Commons CC-BY-SA 4.0 license (http:// creativecommons.org/licenses/by-sa/4.0/). Finally, we thank Asher Pasha for making the Human eFP data sets available for download on GitHub.

Author Contributions
Conceived and designed the experiments: NJP RVP ETH. Performed the experiments: RVP ETH NJP. Analyzed the data: RVP. Contributed reagents/materials/analysis tools: ETH. Wrote the paper: RVP NJP ETH.