Characterization of Serum Proteins Associated with IL28B Genotype among Patients with Chronic Hepatitis C

Introduction Polymorphisms near the IL28B gene (e.g. rs12979860) encoding interferon λ3 have recently been associated with both spontaneous clearance and treatment response to pegIFN/RBV in chronic hepatitis C (CHC) patients. The molecular consequences of this genetic variation are unknown. To gain further insight into IL28B function we assessed the association of rs12979860 with expression of protein quantitative traits (pQTL analysis) generated using open-platform proteomics in serum from patients. Methods 41 patients with genotype 1 chronic hepatitis C infection from the Duke Liver Clinic were genotyped for rs12979860. Proteomic profiles were generated by LC-MS/MS analysis following immunodepletion of serum with MARS14 columns and trypsin-digestion. Next, a latent factor model was used to classify peptides into metaproteins based on co-expression and using only those peptides with protein identifications. Metaproteins were then analyzed for association with IL28B genotype using one-way analysis of variance. Results There were a total of 4,186 peptides in the data set with positive identifications. These were matched with 253 proteins of which 110 had two or more associated, identified peptides. The IL28B treatment response genotype (rs12979860_CC) was significantly associated with lower serum levels of corticosteroid binding globulin (CBG; p = 9.2×10−6), a major transport protein for glucocorticoids and progestins. Moreover, the CBG metaprotein was associated with treatment response (p = 0.0148), but this association was attenuated when both IL28B genotype and CBG were included in the model, suggesting that the CBG association may be independent of treatment response. Conclusions In this cohort of chronic hepatitis C patients, IL28B polymorphism was associated with serum levels of corticosteroid binding globulin, a major transporter of cortisol, however, CBG does not appear to mediate the association of IL28B with treatment response. Further investigation of this pathway is warranted to determine if it plays a role in other comorbidities of HCV-infection.


LC-MS Operation
Each sample was analyzed by injecting approximately 1 ug of total digested protein onto a 75um × 250 mm BEH C18 column (Waters) and separated using a gradient of 5 to 40% acetonitrile with 0.1% formic acid, with a flow rate of 0.3uL/min, in 120 minutes on a nanoAcquity liquid chromatograph (Waters). Electrospray ionization was used to introduce the sample in real-time to a Q-Tof Premier mass spectrometer (Waters), collecting data in MSE mode with 0.9 second alternating scans between low CE (6V) and a high CE ramp (15V to 40V). Data collection in this fashion supplies sufficient sampling across the chromatographic elution of a peptide for accurate quantitation, while also allowing acquisition of data used for the qualitative identifications. Technical reproducibility was assessed by running a subset of the samples in triplicate (n=6) and also by analyzing a pooled sample at predefined intervals. In addition, a number of data-dependent LC-MS/MS analyses were performed using the same LC gradient and injection volumes; these runs provided column conditioning prior to quantitative analysis, and in some cases complementary peptide identifications.

Preparation of data for analysis
To accomplish data alignment and feature quantitation across all biological samples and thus form the matrix discussed in the statistical methods section below, we utilized Rosetta Elucidator™v3.3 software package (Rosetta Biosoftware) to import and align all MSE and datadependent acquisition (DDA) raw data files. [1][2][3][4][5]  or queued directly from within Elucidator (Mascot) to allow identification of many of the 3 quantified features in the proteomic dataset. All database searches were performed with high mass accuracy on precursor and product ions (typically 20 ppm precursor and 0.04Da product ion tolerance), with fixed carbamidomethylation(Cys), variable oxidation(Met) and variable deamidation(Asn and Gln). Annotation of the peptides is accomplished at an estimated 1% FDR using the Elucidator implementation of PeptideProphet algorithm. 6 Visual scripting within Elucidator is utilized to extract feature intensities for those features which have quantitative values above the 1000 counts (approximately 10th percentile) in 50% of the samples. The final file for statistical analysis is made up of a matrix of intensities, with the rows corresponding to isotope groups and the columns to technical observations (LC-MS analysis). An isotope group is defined as all of the peaks associated with a single peptide at a specific charge state and retention time. This level of quantitation combines peaks from the same peptide that differ according to the number of carbon 13's incorporated, but does not combine the same peptide measured at different charge states. The intensity of an isotope group for a given sample is the total volume under the feature peaks associated with that isotope group. This is monotonically related to the concentration of that isotope group in the original sample, and it is these intensities that we work with.

Metaprotein Statistical Model
In order to estimate metaprotein abundance, we build our model from pre-processed data (described in previous section) with intensity estimates aggregated at the isotope group level.
We introduce the term metaprotein here to differentiate this approach from those in which peptide identifications lead to a fixed assignment of a particular peptide to a protein. In our modeling approach, we allow the possibility that an isotope group will be incorrectly identified, or be correctly identified, but have a pattern of expression that is distinct from the bulk of peptides from the corresponding protein. In practice, this new grouping approach often leads to metaproteins which may be dominated by isotope groups from a particular protein, but which contain isotope groups from other proteins as well.
Let be a -dimensional matrix consisting of measurements on isotope groups across samples.

(1)
The -dimensional vector has elements representing the mean expression of isotope group and is a column vector of ones. The -dimensional matrix represents latent factors which will be learned from the data and is a -dimensional matrix of factor loadings with elements . The random variable is a matrix of idiosyncratic noise.
Our goal is to estimate relative protein concentration from this model using the latent factors in Λ. Recall that we have identifications for some subset of the isotope groups. With this in mind, suppose we identify each column of and the corresponding column of with one identified protein. If we set = 1 when isotope group is from a peptide identified as coming from protein and = 0 otherwise, then our model is describing the expression pattern of each isotope group as a noisy approximation of the expression pattern of the protein, where the protein is known.
Retaining, for the time being, the idea of fixing in this way, we wish to handle the possibility of changing sensitivity and changing protein concentration from sample to sample.
To account for this, we introduce an additional set of latent factors into equation 1.

Λ (2)
We now introduce latent factors and loadings where which we use to account for systematic structure in the data that is sample specific. Because these features will span almost all peptides, we utilize a generic Gaussian prior for the elements of .
This distribution represents our belief that these effects span all isotope groups, but with varying effect sizes. This prior also minimizes identifiability issues between , which is not sparse, and which is very sparse with somewhat informative priors.
We want to modify our prior on to allow for possible post-translational modifications and for misidentifications. With this in mind, we want to relax our strict assignment of zeros and 5 ones in the loadings matrix . Instead, our prior distribution for will reflect our level of certainty that we know which factor should represent the expression of this peptide. When we have an identification for peptide and have mapped that peptide to protein , our prior distribution will reflect an increased certainty that ≠ 0.
We introduce a -dimensional vector of latent variables ( ) which identifies the non-zero column of for each isotope group. When we have an identification that suggests that isotope group comes from protein , our prior distribution for is where is substantially larger than to reflect our prior belief that . We default to = 500 • , but have tried values from 100 through 1000 and these lead to only minor shifts in metaprotein membership. As the weight of this prior decreases ( decreases while the ratio of to stays the same) we are decreasing the importance of identification information and placing progressively more importance on correlation structure. We have found, for the Hepatitis application below (omitted from this document; available upon request), that the association of metaproteins with outcome doesn't substantially change until we increase the weight of the identification data to very high levels. We find that using = 1 leads to interpretable metaproteins without loss of association with the outcomes. the isotope group level, and as such they have between 20 and 40 thousand measurements per sample. While our sampling scheme is able to fit this data in just a few hours on a desktop, we expect that some sort of parallel processing will be desirable for data that is aggregated at the feature level. We have tested our model on multiple simulated data sets of various sizes (both sample size and number of isotope groups) to verify the accuracy of the parameter recovery even in the presence of intentionally mislabeled isotope groups.