Conceived and designed the experiments: MT LB. Performed the experiments: MT. Analyzed the data: MT JT AW JL LB TT. Contributed reagents/materials/analysis tools: TT. Wrote the paper: MT JT LB. Other: Confirmed statistical analysis: JL AW.
The authors have declared that no competing interests exist.
Treating hepatitis C with interferon/ribavirin results in a varied response in terms of decrease in viral titer and ultimate outcome. Marked responders have a sharp decline in viral titer within a few days of treatment initiation, whereas in other patients there is no effect on the virus (poor responders). Previous studies have shown that combination therapy modifies expression of hundreds of genes in vitro and in vivo. However, identifying which, if any, of these genes have a role in viral clearance remains challenging.
The goal of this paper is to link viral levels with gene expression and thereby identify genes that may be responsible for early decrease in viral titer.
Microarrays were performed on RNA isolated from PBMC of patients undergoing interferon/ribavirin therapy. Samples were collected at pre-treatment (day 0), and 1, 2, 7, 14 and 28 days after initiating treatment. A novel method was applied to identify genes that are linked to a decrease in viral titer during interferon/ribavirin treatment. The method uses the relationship between inter-patient gene expression based proximities and inter-patient viral titer based proximities to define the association between microarray gene expression measurements of each gene and viral-titer measurements.
We detected 36 unique genes whose expressions provide a clustering of patients that resembles viral titer based clustering of patients. These genes include IRF7, MX1, OASL and OAS2, viperin and many ISG's of unknown function.
The genes identified by this method appear to play a major role in the reduction of hepatitis C virus during the early phase of treatment. The method has broad utility and can be used to analyze response to any group of factors influencing biological outcome such as antiviral drugs or anti-cancer agents where microarray data are available.
Treating with peginterferon/ribavirin combination therapy patients who have chronic hepatitis C virus (HCV) infection results in a varied response in terms of outcome and decrease in viral titer
In this paper we report a novel mathematical method to explore the association between decrease in viral titer and changes in gene expression in hepatitis C patients following combination treatment with pegylated interferon and ribavirin. The viral clearance time course profile will not necessarily directly correlate with the gene expression time course profile even if the gene is an active participant of the interferon treatment response because the decrease of the viral levels depends on the interplay of many genes and gene products. Therefore, an indirect approach was used in which the relationship between gene expression across days and viral decrease was examined using inter-patient distances (proximity) according to both characteristics.
Using this approach we selected thirty seven gene probes that were linked with the anti-HCV response during the first 28 days of treatment. A visual demonstration of the association of detected genes with the viral decrease is demonstrated by a comparison of patient clusterings. Indeed, the inter-patient proximities according to the pattern of decrease in virus titer provide an unsupervised clustering of patients based on changes in viral levels. Similarly, the inter-patient proximities according to expressions of the specified genes across time provide another unsupervised clustering of patients. A visual inspection of viral-titer based and selected genes expression based clusterings of patients indicates their close relationship. Since the unsupervised clustering of patients according to the pattern of viral clearance is in good correspondence with an a priori biological categorization of patients into marked, slow and poor response at day 28
The Virahep-C study included a cohort of 401 participants who provided written consent at 8 U.S. clinical centers between September 9, 2002 and January 7, 2004. Per protocol, all participants were treated for up to 48 weeks with pegylated interferon alfa-2a (PEGASYS, Roche Inc. Nutley, NJ) at 180 mcg weekly by self-administered subcutaneous injection and ribavirin (COPEGUS, Roche, Inc. Nutley N.J.) by mouth at 1000 mg/day for those who weighed less than 75 kg or 1200 mg/day for those who weighed at least 75 kg. Treatment was discontinued at week 24 in participants who had a detectable serum HCV RNA virus in duplicate qualitative assays using the Roche Cobas Amplicor HCV Test, v2.0 (sensitivity 50 IU/ml). The primary endpoint of the study was the sustained virologic response, defined as an undetectable serum HCV RNA at week 72 (at least 24 weeks after completion of treatment). The clinical trial was #-NCT00038974.
Fifty two patients for whom gene expression and viral level data were available at all time points were selected for this analysis. Based on the log-decline in serum HCV RNA on day 28 of treatment relative to day 0, as measured by the quantitative Roche Cobas Monitor HCV Test, v2.0 (sensitivity 600 IU/ml) these patients were divided into three groups
This study met all necessary approvals of Institutional Review Boards of each institution participating in the Virahep-C consortium (Beth Isreal Deaconess Medical Center, New York Presbyterian Medical Center, Rush University, University of California at San Francisco, University of Maryland, University of Miami, University of Michigan, and University of North Carolina).
Peripheral Blood Mononuclear Cells (PBMC) were collected in sodium heparin-CPT tubes at day 0, 1, 2, 7, 14 and 28. Samples were shipped overnight from each clinical center to a central repository by express courier at 4°C. Whole blood was diluted with an equal volume (8 ml) of phosphate buffered saline, carefully layered over a 10 ml Ficoll-Hypaque gradient (Amersham/Pharmacia) and centrifuged at 800 rpm for 20 minutes at room temperature. The buffy coat layer was transferred to a 15 ml RNAse-free tube and further diluted with PBS. Tubes were centrifuged at 100-× g for 15 minutes at room temperature. The supernatants were discarded and the PBMC were retained.
The isolation of RNA, quality control, the labeling and hybridization on to the micro-arrays have been previously described
The microarrays were scanned using a dedicated Model 3000 scanner controlled by Affymetrix Microarray Suite 5 software (MAS5). The average intensity on each array was normalized by global scaling to a target intensity of 1000. Data were exported from MAS5 into a custom-designed database (MicroArray Data Portal) in the Center for Medical Genomics (IUPUI, Indianapolis). The data from the microarrays has been deposited with NCBI, GEODATA# GSF7123
Visualizing the patient data was simplified through Principal Component Analysis (PCA)
Both PCA and the clustering of patient's groupings according to either viral titer or gene expression data were performed to visualize the correspondence between predefined grouping of patients (marked, intermediate, and poor, highlighted by color) and either gene expression or viral titer based grouping of patients.
Detection of genes that provided clustering similar to viral titer based clustering of patients was performed by a mathematical method similar to the mirror tree method for inferring protein interactions from phylogenetic distance matrices
The distance between patients' viral titer changes from day 0 was used to estimate the inter-patient proximity with respect to viral change. Similarly, the gene expression based inter-patient proximity was measured for each gene. The correlations between each gene based matrix proximity and viral titer based matrix-proximity were estimated and genes with the highest correlations were selected.
As metrics of inter-patient proximity two measures were used: Euclidean distance and the coefficient of covariation. The larger the coefficient of covariation, the closer the patients are whereas the smaller the Euclidean distance the closer the patients are. Thus, the inter-patient distances based on viral titer time-course measurements created a grouping of patients that reflected the response of patients to antiviral treatment. On the other hand, the inter-patient proximities based on gene expression measurements across days (gene expression profile) for a specific gene reflected the variability of patients either according to response of this gene to treatment, or according to genetic heterogeneity of patients, or both.
The Euclidean distance between the natural logarithms (ln) of changes in viral levels from baseline (day 0) for patient
Similarly, the inter-patient Euclidean proximities with respect to the natural logarithm (ln) of the gene expression for the
The inter-patient proximities according to expression of
We will refer to the vectors
Similarly to Euclidean distance based vectors
The associations between vectors G, GD and vectors MV and VT both for Euclidian and Covariance metrics were estimated by the Pearson correlation coefficients. We assume that (i) the expression of the gene log-signal in a patient is normally distributed, (ii) expression values of a gene across patients are independent, and (iii) for all genes there are no dependences of between-patient distances according to gene expression and between-patient distances according to viral titer values. The method detects linkage of genes with viral titer as statistically significant deviations against the point (iii), i.e., differences in correlation coefficients from 0. The statistical significance of correlation coefficients may be calculated using Fisher's Z transformation
In order to take into account both Euclidean and Covariance measures of proximity in the selection of VT-linked genes, the analysis was done as follows. A multidimensional scaling
The deviation of some GD-vectors from the core of their distribution in the multi-scaling space in the direction of VT vectors (i.e. at PC2 direction) indicates a link of these gene-days (and genes) with viral titers. As the initial criterion for gene-day detection the PC2>threshold was used. The statistical significance of such gene detection was checked by Fisher's Z-test and via beta distribution for correlation coefficients of G(GD) vectors regarding VTcov vector. Both tests used an adjusted number of degrees of freedom to compensate for a weak inter-dependence of vector coordinates.
The more accurate check of the gene detection significance and the estimation of False Discovery Rate (FDR) were performed through permutations. In the first step the distribution of “random” gene-days in the multiscaling space was prepared through permutation of gene-days log-signals over 52 patients of the study. After that correlations of permutated gene-day based inter-patient matrices with VT and MV vectors in k(k−1)/2 space were calculated, and the previously found transformation of the initial 8-dimension space into the multiscaling space was applied to these correlations.
The FDR estimation for the detected set of real genes was done as follows. The sub-space that the gene set is occupied in the multiscaling space was defined as the sphere around the central position of this set. The ratio of the number of permutated gene-days inside the sphere (normalized to the size of real gene set) to the number of the real gene-days inside the sphere is the False Discovery Rate estimation for a sphere of the given radius.
As a first step, the correlation between patient classification based on decrease in virus titer by day 28 and the unsupervised viral titer based clustering of patients was tested. Namely, natural log-transformed viral titers of the 52 patients were normalized using the baseline viral level (i.e. day 0) [i.e. ln(vi )−ln(v0) ], where vi is the viral titer value at day i. The clustering of patients was done by the hierarchical UPGMA method
All viral titer values of each patient are normalized by the same patient day 0 viral titer measurement. The Euclidean measure based hierarchical (UPGM) clustering indicates a good separation of the early response Marked (pink) and Poor response (green) patients. The Slow (yellow) patients are mostly concentrated at the intermediate branch of the clustering tree.
Another compact visualization of patient grouping according to the same baseline-normalized viral titer data was performed using PCA. The first two principal components covered 93% of total data variability. The distribution of patients using the classification given above demonstrated clear separation of poor response patients from marked response patients (
First two Principal Components cover 93% of data variability and thus gives the good presentation of the patient distribution in the 5 dimensional space. There is the clear divergence of Marked (dark blue points) and Poor (green points) response patients. The separation line (dotted blue line) indicates the virtual border between these two populations. Slow patients (pink points) are mostly concentrated along the borderline.
In the second step, the viral titer linked genes were determined. We defined VT-linked genes as the ones that produced a clustering of patients similar to viral titer based clustering of patients. The procedure for their detection was as follows: Gene expression for each day was normalized with regard to day 0 expression, as was done for viral titer. The determination of genes was based on the hypothesis that the between-patient proximities according to gene and gene-day expression pattern of specific genes are correlated with the viral titer based inter-patient proximities. Thus we looked for genes (gene-days) that “associated” with the viral titer. The selection was done through multidimensional scaling PCA representation for the space of correlations between inter-patient distance matrices (
Any matrix of inter-patient distances for k patients could be presented as a point (vector) in k (k−1)/2 dimensions space. There are two main vectors in this space; the inter-patient matrix according to viral titer (blue dotted vector) and inter-patient matrix according to expression of all genes (black vector). The expression of every gene is presented as a point (vector) in the same space: the inter-patient matrix according to the expression values of this gene. Some genes could be close to viral-titer vector VT or/and to all gene expression vector MV. Inter-patient matrices according to expression of individual genes (G) at specific days (GD) are dots of the figure. Points of high correlation with VT vector (pink dots) are VT-linked genes. MV linked genes (green dots) define individual variability of patients according to their gene expression. The genes not linked with the two main vectors genes are in red.
One may interpret this figure as the view from above on
The correlation coefficients of all detected genes with VTcov (as representing VT group of inter-patient matrices) are more that 0.24 (
The permutation analysis (see M&M) was based on 1000 permutations of gene-day signals across all 22000 genes at 5 days. The minimal radius of the sphere that covers 80% of gene-days of the selected 37 genes in the 4-mer multi-scaling space is 2.6. This radius corresponds to FDR 1% of the detection. The four dimensional multi-scaling space was taken because such a number of PCA components cover more than 80% of gene expression data variability.
Most of detected genes (
Gene name | Unigene ID | Genbank ID | Gene description |
BLZF1 | Hs.494326 | U79751 | Basic leucine zipper nuclear factor. |
DDX58 | Hs.438386 | NM_014314.1 | RIG 1 ( helicase) |
DNAPTP6 | Hs.230767 | AK002064.1 | DNA polymerase activated protein |
EIF3S6IP | Hs.446852 | AA862804 | Eukaryotic initiation factor 2. |
EPHB2 | Hs.523329 | AI038197 | Ephrin B/tyrosine kinase receptor family. |
FLJ20035 | Hs.481141 | AI093428 | Helicase |
FLJ38348 | Hs.546523 | AV755522 | coiled-coil domain 75 ( CCDC75):RNA binding proteins |
G1P2 | Hs.458485 | NM_005101.1 | ISG15 ubiquitin—like modifier. |
G1P3 | Hs.287721 | NM_022873.1 | IFI6: inhibitor of apoptosis |
HERC5 | Hs.26663 | AA905126 | Ubiquitin ligase 5 |
HERC6 | Hs.435365 | NM_017912.1 | Ubiquitin ligase |
IFI27 | Hs.532634 | AA991433 | Unknown function |
IFI44 | Hs.82316 | BE049439 | hepatitis C associated microtubule protein. |
IFI44L | Hs.389724 | NM_006820.1 | histocompatibility 28 |
IFIH1 | Hs.389539 | NM_022168.1 | Helicase domain 1. |
IFIT1 | Hs.20315 | AA975472 | IFI56, Induced protein with tetratricopeptide repeats-1. |
IFIT3 | Hs.47338 | AA991285 | Interferon-induced protein with tetratricopeptide repeats 3, ISG60 |
IFIT5 | Hs.252839 | N47725 | IFI58, induced proteins with tetratricopeptide repeats. 5 |
IFRG28 | Hs.43388 | AA970212 | Receptor transporter 4 |
IRF7 | Hs.166120 | AA991566 | Interferon regulatory factor 7 |
LAMP3 | Hs.518448 | BX116004 | lysosomal associated membrane 3. |
MX1 | Hs.436836 | NM_002462.1 | GTP-binding protein |
MX2 | Hs.926 | AI015252 | Dynamin and GTPase family |
OAS2 | Hs.414332 | NM_016817.1 | 2′5′Oligo A synthetase 2, 69/71kD. |
OASL | Hs.118633 | CB125965 | 2–5 oligo A synthetase like. |
PABPC4 | Hs.169900 | T05603 | Inducible Poly A binding protein |
PCTK3 | Hs.445402 | BC000281.1 | PCTAIRE protein kinase 3 |
PLSCR1 | Hs.130759 | AI825926 | Phopholipid scramblase: enhances anti-viral gene response. |
RPL22 | Hs.515329 | BE250348 | 60 S ribosomal protein: EB virus binding protein |
RSAD2 | Hs.17518 | AI337069 | viperin, cig 2. |
SAMD4 | Hs.98259 | AB028976.1 | translation regulator. |
SN | Hs.31869 | N53555 | Sialoadhesion |
SNF7DC2 | Hs.415534 | NM_015961.1 | chromatin modifying protein 5 (CHMP5) |
TBX3 | Hs.129895 | NM_006187.1 | T-box transcription factor 3. |
TRIM5 | Hs.350517 | AF220028.1 | tripartite motif-containing 5 |
USP18 | Hs.38260 | AA976038 | Ubiquitin specific peptidase |
Visualization of the sources of variation of patients according to gene expressions of genes- classifiers was simplified through PCA, which reduced the dimensionality of the data into a relatively small number of components. PCA presentation is illustrated in
Two first principal components cover 63% of the data variability. The virtual border line between distributions of Marked and Poor response patients gives three misclassified Marked patients and four misclassified Poor response patients.
Marked (pink) and Poor response (green) patients are mostly separated in two branches of the tree. There are three Marked and six Poor response patients misclassified.
The goal of this paper is to link heterogeneous sets of observations (gene expressions and viral levels) without an a priori hypothesis. We developed a mathematical model that can be applied to any situation using gene expression and viral titer or any other attributes. Application of this approach to patients treated with interferon/ribavirin is based on the assumption that distances between sets of patient's attributes reflects a biological demarcation of patients. As patient attributes we applied the following measurements: (i) viral titer profile during the first four weeks of treatment (viral levels on days 0, 1, 2, 7, 14, and 28), and (ii) the expression of all 22,000 genes of the Affymetrix array across the time course of the treatment.
The difference among patients according to all 22,000 genes on the array reflects the overall gene expression heterogeneity of hepatitis C patients undergoing interferon/ribavirin treatment. This could be due to differences in response, to genetic heterogeneity, or to difference in arrays and handling of RNA. The differences between patients according to any single gene reflect the gene-specific variability of the hepatitis C patients. The gene-specific variability that correlates with the viral titer variability is what is analyzed in this paper and which may be independent of the above. No differences were found in gene expression in an analysis of RNA isolated from PBMC before treatment, when patients were divided into responders and non-responders. Thus the differences are reflection of treatment rather than a reflection of the course of hepatitis C infection.
The gene-specific divergence of patients is checked against the overall patient divergence according to all genes. It appears that the process of virus clearance by interferon/ribavirin is not the major part of the overall gene expression pattern. Indeed, VT-linked genes (genes identified as viral clearance related) make a rather small input into the MV vector, which represents the pattern of overall gene expression variability of patients (
We examined the variability of patients according to changes in viral titer with time. This analysis demonstrates that the clustering of patients [
A very large number of genes are modified by the treatment
Among the genes we identified as important is IRF-7. This gene is required for the induction of type I interferons
Two genes that are down regulated correlate with the viral response: RPL22 (ribosomal protein L22), a component of the 60S ribosome, and eukaryotic translation initiation factor 3, subunit 6 interacting protein. Whether this decrease is involved in virus inhibition through modifying IRES-dependent translation of the HCV genome is speculative.
It is of interest that not only inducible genes appear to be a major component of the interferon response, but also down regulation (repression) of a translation factor and ribosomal protein.
Since this was an unsupervised analysis and did not take into account A/P ( absence/present) filtering, some of the genes are possibly not involved in the anti-viral response, since they were not present in specific classes of patients as analysed with MAS5 soft ware. These include BLZF1, EPHB2, PCTK3 and SNF7DC2
In summary we have identified key genes in the response to interferon/ribavirin in hepatitis C patients using a novel method of analysis. This is based on correlation with decrease in virus titer. This method has broad utility and can be used to analyze response to any group of factors influencing biological outcome.
Contains 166 gene-days of 37 viral titer linked genes (probe sets). Not less than 3 gene-days of the gene have component PC2 values more than 5 (these gene-days are pink dots of the
(0.23 MB XLS)
We thank Mary Ferris for the excellent record-keeping, and entering of information into the portal at the center for Medical Genetics. We thank Ron Jerome and Chunxiao Zhu for expert assistance with the microarray studies, which were carried out using the facilities of the Center for Medical Genomics at Indiana University School of Medicine. The Center for Medical Genomics is supported in part by grants from the Indiana 21st Century Research and Technology Fund and the Indiana Genomics Initiative (supported in part by the Lilly Endowment, Inc.). We also wish to thank Song Zhang from the data coordinating center, Pittsburgh for statistical support and Jay H. Hoofnagle for help in editing the manuscript.