ZODET: Software for the Identification, Analysis and Visualisation of Outlier Genes in Microarray Expression Data

Summary Complex human diseases can show significant heterogeneity between patients with the same phenotypic disorder. An outlier detection strategy was developed to identify variants at the level of gene transcription that are of potential biological and phenotypic importance. Here we describe a graphical software package (z-score outlier detection (ZODET)) that enables identification and visualisation of gross abnormalities in gene expression (outliers) in individuals, using whole genome microarray data. Mean and standard deviation of expression in a healthy control cohort is used to detect both over and under-expressed probes in individual test subjects. We compared the potential of ZODET to detect outlier genes in gene expression datasets with a previously described statistical method, gene tissue index (GTI), using a simulated expression dataset and a publicly available monocyte-derived macrophage microarray dataset. Taken together, these results support ZODET as a novel approach to identify outlier genes of potential pathogenic relevance in complex human diseases. The algorithm is implemented using R packages and Java. Availability The software is freely available from http://www.ucl.ac.uk/medicine/molecular-medicine/publications/microarray-outlier-analysis.


Introduction
Many human diseases, such as inflammatory bowel disease and type 1 diabetes, are complex, multifactorial syndromes with genetic and environmental determinants. Substantial heterogeneity with regards to causation and disease progression exists between individual patients with the same phenotypic disorder. It has been postulated that rare or low frequency variants, structural rearrangements such as deletions, insertions, translocations, and epigenetic variation could be important in the pathogenesis of these complex disorders and account for the observed heterogeneity [1]. All of these are incompletely assessed by current genome wide association studies (GWAS). Many of these genetic changes would be expected to be associated with alterations in gene expression, possibly of large biological effect, ultimately giving rise to phenotypic abnormalities. A recent paper combined population-scale human genomic sequence data with transcriptomic data and identified an enrichment of rare variants associated with outlier gene expression [2]. They concluded that across multiple tissues and developmental stages, an individual would be expected to have hundreds of rare variants with large effects on gene expression. The examination of significantly overexpressed genes in individual patients (or subgroups of patients) has successfully been employed in the field of cancer genomics [3].
There are a number of methods for outlier detection currently in the literature, such as the gene tissue index (GTI), cancer outlier profile analysis (COPA) and outlier robust test (ORT), each of which use different algorithms in order to identify probes that are abnormally expressed in subgroups of patients [4][5][6]. Here we describe a software package based on z-score outlier detection (ZODET) that enables identification of potentially biologically relevant abnormalities in gene expression (outliers) in individuals with complex disorders compared with a comparison population, using whole genome microarray data. By concentrating on individual outliers, we provide a valuable addition to commonly used microarray analysis tools, such as SAM [7].

Software Implementation
Implementation of this software has used two programming technologies: The R statistical programming environment (http:// www.R-project.org), utilising the Biobase package from the Bioconductor platform; and the Java programming language [8]. The analysis can be run via a configurable Graphical User Interface (GUI) or on the command line ( Figure 1). Installation and configuration instructions are provided in the technical documentation supplied with the software (http://www.ucl.ac.uk/ medicine/molecular-medicine/publications/microarray-outlier-analysis). The software is freely available and can be run on either Windows or Mac OSX operating systems.
The purpose of this software is the identification and visual analysis of outlier probes (or genes) from microarray gene expression data. To identify the potential outliers a control group of samples and the experimental samples are defined prior to analysis ( Figure 2A). Currently, the software supports five alternative methods for defining which sub-set of individual samples the potential outliers may occur in and the group of samples (control group) which each is compared to. These five methods of comparison are: For each of the samples selected for outlier analysis, an iterative procedure is carried out on the expression data. Two adjustable thresholds (which both have to be met) are used to identify the probe outliers: (i) the significance level of the standardised deviation of the expression levels from the experimental sample, when compared to the average (mean) expression levels of the control group; and (ii) the (log 2 ) fold-change between the expression level from the experimental sample and the average of the control group. This method assumes that the expression values are normally distributed. Therefore, caution should be taken to ensure the normality of the data, especially when using small sample sizes (e.g., less than 30). Figure 1. The Graphical User Interface (GUI) allows the user to set the analysis parameters, the required fold change, statistical test (p-value, q-value or Bonferroni corrected p-value) and statistical threshold. The analysis can be conducted using the whole experimental group or a specific individual can be chosen. doi:10.1371/journal.pone.0081123.g001 To assess the significance of the standardised deviation of the experimental sample expression values from the control group average, a Z-score is calculated for each probe (or feature) on the microarray: Where: E sample is the expression level of the microarray gene probe for the experimental sample; and m and s are the mean and standard deviation of the expression levels of the gene probe for samples in the control group. From the Z-scores, p-values are then calculated with reference to a standard normal distribution where the p-value is the probability that the probe z-score of the sample being tested falls within the distribution of the control group samples. These p-values -or those corrected for multiple hypothesis testing, on a per gene basis, using either the Bonferroni or q-value method -can be used to assess the significance [9]. The default p-value threshold is 0.01, but this can be easily modified before each analysis run. The fold-change between the expression level of the experimental sample, and the control group, is calculated for each probe on the array using the following method.
This provides an additional measure of the difference in the expression level of the experimental and the control groups and also determines whether the potential outlier-probe is under or Figure 2. Overview of experimental design and output from the ZODET analysis. A. Microarray experimental design will contain two groups, a control group which will provide the data for the normal distribution of each probe and an experimental group (C1-3). The software runs each test group member independently against the control group and identify probes that are expressed at levels significantly outside the normal distribution. B. For each experimental subject the software generates a scatter plot of all of the expressed probes from subject C1 against the mean value from the control group. Probes that are classified as up-or down-regulated are highlighted in red and green, respectively. A Volcano plot is generated to visualise p-value and fold change information for all probes, with over and under expressed outliers highlighted in red and green, respectively. The vertical and horizontal lines represent the fold-change and p-value thresholds used, respectively. A combined dendrogram and heatmap shows the expression level of probes identified as down-regulated in C1, compared to the control group and two additional experimental subjects (C2 and C3). C. The software also generates outlier analysis results for all experimental subjects (C1-3). The total up-regulated and downregulated probes are tabulated for all test subjects along with the results for the three available statistical tests. Combined dendrogram and heatmaps are generated for all of the up-regulated and down-regulated probes identified and hierarchical clustering on both probes and subjects performed. Over and under expressed probes are highlighted in red and green, respectively. doi:10.1371/journal.pone.0081123.g002 over expressed, relative to the control group. In equation 2: foldC is the fold-change; E sample is the expression level of the test sample; and m is the mean expression level of the control group. To correctly estimate the fold-change between sample classes all gene expression levels used as input to this software should be in normalised log 2 format. The software is designed to not be microarray platform specific and can be used to analyse any log 2 transformed microarray expression data (e.g., from Affymetrix, Illumina or other arrays). The default fold-change threshold is 1, which can be easily modified before each analysis run. User defined annotation files can also be used to associate the microarray probes with the relevant gene descriptions.
To aid the interpretation and visualisation of the results a number of clearly formatted graphs and tables are automatically generated by the software during the analysis, including: For each individual ( Figure

Patient Recruitment
These studies were approved by the Joint UCL/UCLH Committee for the Ethics of Human Research (project numbers 02/0324). Forty volunteers were recruited from University College London Hospital (UCLH)/University College London (UCL). Written consent was obtained from all volunteers.

Monocyte derived macrophage isolation and culture
Peripheral venous blood was collected from subjects into syringes containing 5 U/ml heparin. Mononuclear cells were isolated by differential centrifugation (800 g, 30 min, 20uC) over Lymphoprep and washed twice with sterile phosphate-buffered saline (PBS; GIBCO, Paisley, UK) at 500 g (5 min, 20uC). Cells were resuspended in 10 ml RPMI-1640 medium (Invitrogen) supplemented with 100 U/ml of penicillin (GIBCO), 100 mg/ml streptomycin (GIBCO) and 20 mM HEPES pH 7.4 (Sigma-Aldrich), and plated at a density of approximately 5610 6 cells/ml in 8 cm 2 Nunclon TM Surface tissue culture dishes (Nunc, Roskilde, Denmark) at 37uC, 5% CO 2 . After 2 h, non-adherent cells were discarded and 10 ml of fresh RPMI supplemented with 10% foetal bovine serum (FBS; Sigma) added to each tissue culture dish. Cells were then cultured for 5 days at 37uC, 5% CO2, with the addition of a further 10 ml of fresh 10% FBS/RPMI after 24 h. Cells were then washed twice in PBS, scraped and spun down at 500 g (5 min, 20uC). Cells were resuspended into X-vivo-15 medium (Cambrex, MD, USA) and plated at a density of 10 6 cells per 8 cm 2 Nunclon TM dish for a further 25 h at 37uC, 5% CO2.

RNA purification
Total RNA was prepared from monocyte-derived macrophages, using the RNeasy Mini Kit with RNase-free DNase treatment (Qiagen GmbH, Hilden, Germany). Optical density readings were determined for OD 260 /OD 280 and OD 260 /OD 230 using a NanoDrop ND-1000 spectrophotometer (Fisher Scientific, Loughborough, UK) to assess protein and solvent contamination respectively.

Whole genome microarray analysis
For each sample, 500 ng of total RNA was amplified and purified using the Illumina TotalPrep-96 RNA Amplification kit (Ambion, UK), according to the manufacturer's instructions.
Biotin-Labelled cRNA was then normalised to a concentration of 150 ng/ml and 750 ng was hybridised to Illumina Human-WG6 v3.0 Expression BeadChips (Illumina CA, USA) for 16 h at 58uC. Following hybridisation, beadarrays were washed and stained with streptavidin-Cy3 (GE Healthcare, UK). Beadarrays were scanned using the Beadarray reader and image data was then processed using Genome Studio software (Illumina, CA, USA). The cubic spline normalised data and subject information can be found at http://www.ucl.ac.uk/medicine/molecular-medicine/publications/ microarray-outlier-analysis. The microarray data has also been deposited in the Gene Expression Omnibus (GEO) under accession GSE51256.
Due to the relatively low sample sizes, the dataset was tested for normality using the lillie.test function from the R-package nortest (http://cran.r-projects.org/web/packages/nortest/) [10]. All 20,019 probes on the array were assessed for normality using a p-value of 0.05, followed by correction for multiple testing. This resulted in a very small number of probes (240) found to be not normally distributed indicating that the vast majority of probes are normally distributed and can therefore be used as input to the ZODET method.

Genomic DNA extraction, PCR verification and sequencing
Peripheral blood samples were collected in an ethylenediaminetetraacetic acid disodium salt vacutainer (BD Bioscience, UK) and gDNA extracted using the QIAamp DNA blood Mini Kit (Qiagen GmbH), in accordance with the manufacturer's instructions. PCR verification of the chromosome translocation involving G-protein receptor-128 (GPR128) and TRK-fused gene (TFG) was determined using TFG forward primer CCACAGCCTACCTGT-GAGTG, GPR128 reverse primer TGGGTTGTTTGTGGAAAT. Verification of the intact GPR128 gene was performed using the reverse primer listed for the translocation in combination with forward primer GCAGGCTTTCTTTCTTGAGG. A/G single-nucleotide polymorphism at the exon 7 splice-acceptor site (rs10774671) within 29-59-oligoadenylate synthetase 1 (OAS1) was determined following the method describes previously [11].

Generation of Simulated Expression Dataset
A series of simulated expression datasets were generated to contain a total of 20,000 ''genes'' (or probes) and 100 samples. The samples were divided into equally sized groups of 50 control samples and 50 experimental samples, with the gene expression values for each group generated from a standard normal distribution (i.e., with a mean value of 0 and standard deviation of 1). In order to simulate example cases of gene outliers a range of genes were randomly selected to contain an equal number (20, 50, 100, 250, 500, and 1000) of up-and down-regulated patient samples relative to the control group. For each of these simulated outlier genes, we randomly assigned a number of samples from the experimental test group (between 5 and 10) to correspond to expression levels significantly different to the standard normal distributions. These expression levels were randomly assigned so as to correspond to p-values of being part of the standardised normal sample distributions of between 0.05 and 1610 26 .

Comparison to Existing Gene Outlier Detection Methods
Recently, the GTI outlier detection algorithm was described and compared to other existing gene expression outlier detection methods [4]. It was shown that the GTI method performs comparably to other outlier detection methods (including: cancer outlier profile analysis (COPA); outlier sums (OS); and the outlier robust t-statistic (ORT)) when using a simulated expression dataset [5,6,12]. In order to benchmark our z-score based outlier detection method ZODET against these existing methods we performed a similar comparison to the GTI method, using both simulated expression datasets and also an expression dataset compiled from a cohort of human monocyte-derived macrophage samples split into two equal sized groups.
Both the ZODET and the GTI methods were run on the simulated expression datasets of 100 samples and 20,000 probes and the resulting outputs compared. ZODET was run with the default p-value detection method and threshold, of 0.01, and a fold-change threshold of 0.5. The GTI method was also run with default settings (except that an automatic log transformation of the expression values was not carried out).
To compare the ability of the two methods to correctly identify probes that correspond to the simulated outlier expression values we used Receiver Operating Characteristic (ROC) curves on the ranked outputs ( Figure 3). The GTI output was ranked in descending ''GTI score'' order and ZODET was ranked using the number of samples that were detected to be associated with an outlier probe. These methods of ranking highlight an important difference between outlier detection using ZODET and the GTI method; that is, the GTI method only detects whether a particular gene (or microarray probe) contains outlier samples, whereas ZODET detects the likelihood that a gene (or probe) is an outlier for each of the individual test samples. Therefore, ZODET has the advantage of being able to identify both the genes and the specific test experimental samples that are significantly associated with outlier gene expression values.
Interestingly, both methods consistently performed well when identifying the up-regulated true positive outliers in the simulated expression dataset, resulting in an area under the ROC curve (AUC) of 1.0 (Figure 3). When identifying the down-regulated outliers, ZODET, clearly out-performed the GTI method, with average AUC values of 0.998 and 0.476, respectively (Figure 3). This was expected because the GTI method is specifically designed to identify outliers that are over-expressed relative to the control group, whereas ZODET detects both under-and over-expressed outliers equally well.
In summary, the two methods performed equivalently when identifying the up-regulated outlier genes from the simulated dataset, but importantly ZODET is also able to detect downregulated outliers to a high level of accuracy.

Identification of outlier genes in a monocyte-derived macrophage microarray dataset
An example microarray dataset is available on our webpage (http://www.ucl.ac.uk/medicine/molecular-medicine/publications/ microarray-outlier-analysis). This dataset consists of two data files: (i) an ''EXPRESSION DATA'' file -containing the microarray expression data from monocyte-derived macrophages collected from 40 volunteers split into two equal sized groups labeled control (A1 to 20) and experimental (B1 to 20); and (ii) an ''EXPERIMENT INFORMATION'' file -containing associated experimental parameters. The technical specifications of these input files are described in the provided software documentation.
A ZODET analysis was performed with the thresholds set at a log2 fold-change .1.75 and unadjusted p-value ,0.0001. Combining the ZODET analysis results from subjects B1 to B20 identified 19 up-and 14 down-regulated probes in this population ( Table 1). Three probes corresponding to two genes were identified as significantly up-regulated in three or more individuals within the experimental cohort. Three individuals were found to over-express either G-protein coupled receptor 128 (GPR128) (ILMN_2125395 and ILMN_1808078) or chemokine (C-C motif) ligand 3-like 1 (CCL3L1)(ILMN_1773245). Two probes corresponding to two genes were found to be down-regulated in three or more individuals in the experimental population. 29-59oligoadenylate synthetase 1 (OAS1) (ILMN_1658247) and ribosomal protein S23 (RPS23) (ILMN_1772459) were significantly attenuated in six and three of the twenty subjects tested, respectively. However, the probe for RPS23 (ILNM_1772459) contains a recognised single nucleotide polymorphism (SNP) (rs3738) which accounts for the three down-regulated outliers identified in the experimental cohort. The SNP results in poor probe hybridisation and not a genuine reduction in gene expression. PCR verification should be conducted on all identified probes in order to discount false positives resulting from poor probe hybridisation.
A GTI analysis was also performed on the same data set and the results compared to the output from ZODET (Table 1). In accordance with the simulated data analysis, the GTI software was only capable of identifying the up-regulated probes and gave the highest ranking to both of the GPR128 probes which were also identified using ZODET.
In order to determine if the abnormal gene expression results from genetic mutations, both GPR128 and OAS1 genes were sequenced in the subjects identified as outliers. No mutations were found in any of the individuals who over-expressed GPR128 (data not shown). However, a previous report identified a chromosome translocation involving GPR128 and TRK-fused gene (TFG) which results in the generation of an in-frame TFG-GPR128 fusion transcript [13]. This transcript was found to be constitutively expressed in all tissues and could be the possible cause of GPR128 over expression identified in these individuals. PCR analysis and sequencing was performed on the three outlier subjects ( Figure 4A). The results confirmed the presence of the TFG-GPR128 translocation in the three patients identified as outliers for the GPR128 gene. Further PCR analysis was carried out on the remaining 17 experimental subjects and 20 controls. One additional experimental subject was found to carry the translocation and from the expression data the increased levels of GPR128 can be seen but the thresholds applied to the ZODET analysis meant that this individual was not identified as an outlier ( Figure 4B, green arrow). None of the control subjects carried the translocation. Therefore the ZODET analysis identified three out of the four experimental subjects that were carriers of the TFG-GPR128 translocation. The most common under-expressed outlier gene was OAS1 which was found in six experimental individuals. The gene was sequenced in all six subjects and was found to be homozygous for a known A/G single-nucleotide polymorphism at the exon 7 splice-acceptor site (rs10774671) in the OAS1 gene ( Figure 4C) [11]. Further sequencing of the remaining experimental subjects identified an additional individual who was homozygous for the rs10774671 polymorphism ( Figure 4D, green arrow). Therefore, the ZODET analysis identified six of the seven individuals within the experimental cohort that expressed an OAS1 splice variant. In summary, the ZODET analysis identified a number of outlier genes within the experimental population two of which we were able to show resulted from a genetic abnormality. GTI identified the translocation event but missed the splice site polymorphism which resulted in attenuated gene expression and was present in 40% of the experimental population. As these abnormalities were only present in a subset of the patients in the experimental group, standard group-based statistics (e.g. a t-test) would fail to identify a significant difference. These findings demonstrate a potential advantage of the ZODET analysis over the currently available software.

Discussion
We have described software designed for the identification, analysis and visualisation of outlier genes and individual outlier samples in microarray expression data via a user-friendly GUI. This method implements a z-score based measure of the level of deviation of individual microarray samples, for a particular gene, from an assumed normal distribution of control comparison samples. This differs in approach from the GTI method, which calculates a score for each gene related to the proportion of samples in a specific group that have gene expression levels above a defined statistical outlier threshold. An important advantage of the ZODET software is the user-friendly implementation of a zscore based outlier detection method, which has been specifically designed to be used by individuals with limited bioinformatic experience through a GUI.
Since the GTI method was previously shown to perform comparably to other outlier detection methods we benchmarked the performance of ZODET to this method [4]. We have shown that ZODET performs comparably to the GTI method when identifying up-regulated outliers in a large simulated expression dataset containing a defined numbers of outliers at varying levels of statistical significance from the standardised population mean. ZODET was also shown to perform with a very high accuracy and very low false positive rate when identifying genes and samples simulated to represent down-regulated outliers.
In addition to outlier analysis, other methods that use gene expression variability between groups can be used to identify genes that may have biological relevance [14][15][16][17]. However, in general for these approaches a more robust and dynamic change in expression is required, whereas outlier analysis methods, such as ZODET, can identify an individual or small sub-groups within the experimental population.
We have also demonstrated the power of the method for detecting outlier genes of biological interest in a dataset containing gene expression profiles from a cohort of primary monocyte derived macrophages isolated from human subjects. In particular, the most common under-and over-expressed genes identified by ZODET were shown to result from germline mutations. A functional consequence resulting from the TFG-GPR128 translocation has so far not been reported. Whereas, a previous report has shown that the OAS1 splice site mutation results in the loss of the commonly expressed OAS1 enzyme (p46) and the generation  The GTI ranking is based on the 22,375 probes analysed. ZODET thresholds were set at p,0.0001 and fold change .1.75. All identified probes are shown along with, the gene symbol, the number of individuals who were classified as outliers, the GTI score and GTI rank. The ''count'' shows the number of individuals from the experimental group that were classified as outliers with respect to the control group for each probe. The 19 up-regulated probes and 14 down-regulated probes are shown. doi:10.1371/journal.pone.0081123.t001 of a dual-function antiviral/proapoptotic isoform (p48) and a novel isoform (p52) with no known role. (11) The fact that these genetic alterations can be identified provides evidence that this form of analysis could be of benefit in the study of complex heterogeneous diseases. Proof that this form of analysis can provide diseaserelevant findings comes from the identification of the fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer through the use of the COPA software [3]. Of particular note was the ability of ZODET to identify outlier genes that were either significantly up-or down-regulated when compared to the controls. ZODET is also capable of identifying an individual gene expression abnormality in one experimental subject, which may allow the identification of rare genetic events that have extremely high disease penetrance. These properties provide an important addition to other commonly used outlier detection methods, such as COPA and GTI, which concentrate solely on the identification of strongly up-regulated or amplified genes in subgroups of patients. Each of these outlier detection methods has advantages and disadvantages and they could well be most effectively utilised in combination with each other, although we have not attempted to do so in this study.