^{1}

^{2}

^{2}

^{3}

^{1}

^{2}

^{3}

^{*}

NK, AWY, FMW, and DAL conceived and designed the experiments. NK performed the experiments and analyzed the data. NK, AWY, FW, and DAL wrote the paper.

The authors have declared that no competing interests exist.

Cellular behavior in response to stimulatory cues is governed by information encoded within a complex intracellular signaling network. An understanding of how phenotype is determined requires the distributed characterization of signaling processes (e.g

Cells in the human body interpret extracellular information to “decide” on the execution of particular behaviors such as migration, proliferation, and differentiation. Many diseases, such as cancer, occur when these decision-making processes are compromised. The transfer of extracellular information to the intracellular space is often accomplished through receptor proteins whose chemical properties are altered as extracellular conditions change. These receptors transfer information in the intracellular space through the transfer of phosphate groups from one molecule to another. In particular, the transfer of phosphate groups to tyrosine sites is critical for cellular signaling. How the cell decides to execute a particular behavior on the basis of many changing phosphorylation events, however, is not understood. Here, we apply a computational approach to understand and predict how cells make the decision to migrate and proliferate as extracellular information changes. In particular, we wanted to understand the basis of decision-making processes in cells overexpressing a receptor protein called human epidermal growth factor receptor 2 (HER2). This receptor is overexpressed in ∼30% of breast cancer patients and correlates with poor prognosis. Taking advantage of a recently published dataset that quantified tyrosine phosphorylation events in HER2-overexpressing cells, we created models to understand and to predict HER2-mediated changes in migration and proliferation. The model identified small subsets of measured phosphorylation events that are predictive of changes in behavior with HER2 overexpression. Analysis of the phosphorylated subset of proteins implicated certain cellular processes as being crucial for cellular decision making, and suggested potential biomarkers and targets for therapeutic use in HER2-overexpressing cancers. Further application of our technique should aid in the understanding of cellular decision processes from large sets of cell signal and behavior data.

Recent advances in mass spectrometry have enabled the extensive characterization of intracellular signaling networks [

HER2, a member of the ErbB family of receptors, is overexpressed in ∼30% of breast cancer patients and correlates with poor prognosis and high invasiveness [

In this study, we significantly extend our previous analysis of HER2-mediated signaling and cell function by using a PLSR model to identify a reduced set of phosphorylation measurements and computational rules that together can be used to predict a priori cell migration and proliferation. We also characterize ligand-specific changes in cell signaling that govern migration and proliferation through the novel application of inner product analysis. Specifically, we derive lists of the most important phenotypically relevant proteins characterizing each of 30 possible transitions between our six cell conditions (EGF, HRG, and serum-free treatments in both low and high HER2-expressing cells). Inspection of the lists reveals both regulatory signaling cascades consistent with known HER2 biology and novel hypotheses. Using a conceptually similar procedure, we also derived lists of proteins that uniquely correlated with either migration or proliferation, postulating that these proteins serve as migration- or proliferation-specific signals in our system. Finally, we analyzed the PLSR model to derive a subset of phosphorylation sites most informative for the quantitative prediction of migration and proliferation. We identified nine phosphosites (signals) on six proteins from the original 62 phosphosites (signals), and showed that a model based on only those nine sites had a goodness of fit to experimental data similar to the full model. We identify the nine signals as a “network gauge,” a subset of molecules in the vast network of signaling molecules that together serve as a sensitive readout for cellular response. The nonobvious nature of the nine selected signals highlights the complexity of the network and the usefulness of the modeling approach. Analysis of the network gauge suggests that two elements of network architecture, endocytosis and phosphoinositide 3-kinase (PI3K)-related signaling, are highly informative loci for the control of proliferation and migration. Importantly, models constructed from both the full and network gauge signaling data that were trained only on data from a low HER2-expressing cell line predicted levels of migration and proliferation in a HER2-overexpressing cell line for both EGF and HRG treatments. This suggests that both cell types process information in the signaling network according to the same set of multilinear rules.

As previously described, we developed and employed a mass spectrometry approach to measure the effect of HER2 overexpression in 184A1 human mammary epithelial cells (HMEC) [

The title at the top of each entry indicates the phosphosite measured. Median normalized phosphorylation (see

Cellular migration and proliferation were measured under the same conditions described above [

Parental (black) and 24H (red) data shown for: (A) migration, as measured by a wound healing assay, and (B) proliferation, as measured by [3H] thymidine incorporation. Migration error bars represent the 95% confidence intervals for the fit of the slope using linear regression. Proliferation error bars represent the standard deviation from four different biological replicates for each condition. The data were obtained from [

A model based on PLSR was created to linearly regress signaling metrics onto cellular migration and proliferation metrics ([^{2}) of 0.89 (

Decomposition via PLSR of the signal (

(A) A scores plot identifies separation in signaling strategies associated with receptor overexpression or differential ligand treatment along two signaling axes.

(B) HRG and EGF treatment give rise to different sets of signals, and the difference is exaggerated in 24H cells.

(C) The linear superposition of the difference vector between 24H and parental serum-free conditions and each ligand's vector explains 24H + HRG signaling but not 24H + EGF signaling.

Although the scores plots allow us to visualize signaling changes, it is often of interest to relate observed differences back to original measured signaling metrics. We accomplish this by taking the inner product of any vector in the scores plot with the principal component axes, and thereby derive lists of proteins that most strongly correlate with the transition associated with the vector (see

Each arrow represents the difference vector between two cell states. The changes in the signaling network associated with the transition between the two states are calculated by taking the inner product of the difference vector with the weights vector (see

HER2 overexpression in the presence of EGF, as discussed above, produced interesting signal network changes and increased cell migration but did not affect cell proliferation (see

Signaling Metrics Most Positively Correlated with Changes in Cellular Response due to Increased HER2 Expression under EGF Stimulation

A list of proteins most negatively correlated with phenotype includes all measurements of EGFR phosphorylated at tyrosine 1173 as well as Src, which has been shown to phosphorylate EGFR tyrosine 1173 (

Signaling Metrics Most Negatively Correlated with Changes in Cellular Response due to Increased HER2 Expression under EGF Stimulation

Next, we sought to understand the effect of changing ligand under given receptor expression levels (

Signaling Metrics Most Positively Correlated with Changes in Cellular Response due to Varying Ligand Exposure in Cells with High HER2 Expression

Signaling Metrics Most Negatively Correlated with Changes in Cellular Response due to Varying Ligand Exposure in Cells with High HER2 Expression

As mentioned above, the signaling changes associated with increased HER2 expression in the presence of HRG are very similar to the same change under serum-free conditions. In both cases, HER2 overexpression leads to an increase in migration but not proliferation. Signaling metrics that positively correlate with this transition include p130Cas and FAK, indicating that increased migration may be mediated through this migration-associated pathway (

Signaling Metrics Most Positively Correlated with Changes in Cellular Response due to Increased HER2 Expression under HRG Stimulation

Signaling Metrics Most Negatively Correlated with Changes in Cellular Response due to Increased HER2 Expression under HRG Stimulation

A reduced model based on a fraction of the 62 originally measured phosphorylation sites would be useful for the future study of HER2 effects when full network measurement is not possible. Analysis of the model revealed nine phosphorylation sites on six proteins that recapitulated the performance of the full model. We refer to this subset of signals as a “network gauge”: a small number of phosphorylation sites that together can be interrogated to predict levels of proliferation and migration. To rank phosphorylation events according to their importance in the full model, we used a weighted sum of squares technique (see ^{2} = 0.95). A model based on the bottom-30 ranked metrics had a goodness of fit less than 0.30 to experimental data, suggesting that our ranking appropriately isolated highly informative sets of phosphorylation metrics.

Results from PLSR modeling show that a computational model based on six signals generates predicted values of cellular migration and proliferation that correlate strongly with those predicted by the full model and experimentally measured values. Experimental values of migration (A) or proliferation (B) are graphed along the ordinate, and full-model predictions (black) and reduced-model predictions (red) of migration (A) or proliferation (B) are given along the abscissa. R^{2} values indicate the data's goodness of fit to the line y = x. Experimental error bars denote 95% confidence intervals for migration (see

The somewhat surprising makeup of the network gauge prompted further investigation into why these six proteins were so informative. Shc's tyrosine site at 239/240 regulates c-myc activation; its tyrosine site at 317 regulates mitogen-activated protein kinase activation; and its phosphotyrosine-binding domain is known to associate with phosphatidylinositol-3,4,5-trisphosphate (PIP3), although it is not known how Shc's binding affinity for PIP3 changes with tyrosine phosphorylation [

Two interesting themes, then, emerge from the network gauge findings: (a) the inclusion of a group of molecules linked to endocytosis, namely TfR, ACK, and SHIP-2; and (b) the high proportion of molecules known to interact with PI3K or PIP3, namely Shc, SHIP-2, TfR, and SCF38. We will elaborate further on these themes in the Discussion section.

We investigated whether the full model and the network gauge trained only on parental cell data could predict migration and proliferation levels in response to HER2 overexpression. We trained models on parental serum-free, EGF, and HRG data, performed PLSR to calculate regression coefficients, and then used the measured 24H signal values in the regression equation to predict proliferation and migration. We found that both the full 62-signal model and the network gauge were able to predict proliferation and migration in 24H cells (R ≥ 0.99,

A PLSR model of nine signals constructed from parental cell data only was used to predict (A) proliferation and (B) migration levels in 24H cells and then compared with measured experimental values. Experimental error is denoted by 95% confidence intervals for cell migration (see

Identification of molecules highly, but uniquely, associated with either proliferation or migration are of value when considering strategies to inhibit one behavior without affecting the other. We previously reported the top 20 signals associated with migration and proliferation through an analysis of reduced-dimension PLSR plots [

Analysis Results of PLSR X-Y Loadings Plot Reveals the 20 Signaling Metrics Most Uniquely Correlated with Migration

Analysis Results of PLSR X-Y Loadings Plot Reveals the 20 Signaling Metrics Most Uniquely Correlated with Proliferation

Analysis Results of PLSR X-Y Loadings Plot Reveals the 20 Signaling Metrics Correlated Most Strongly with Both Migration and Proliferation

Analysis Results of PLSR X-Y Loadings Plot Reveals the 20 Least Correlated Signaling Metrics for Both Migration and Proliferation

We have demonstrated the use of PLSR to characterize the relative importance of tyrosine phosphoryation events for cell migration and proliferation in two human mammary epithelial cell lines with varying HER2 expression levels under both EGF and HRG treatment. In addition, we have identified an important subset of molecules from our original large signaling dataset to serve as a network gauge for the prediction of migration and proliferation (

A nine-signal PLSR model trained on parental data predicts migration and proliferation in 24H cells. The line thickness emerging from each protein indicates the relative average importance of each protein in the model for migration (black) or proliferation (gray). Prediction of migration or proliferation is a function of model importance and the amount of phosphorylation measured in either parental cells or 24H cells. The proteins constituting the model are transferrin receptor (TfR), annexin A2 (An A2), solute carrier protein 38 (SCF38), SH2-containing protein (Shc), SH2-containing inositol polyphosphate 5-phosphatase (SHIP-2), and activated cdc42-associated kinase 2 (ACK2). Previously documented associations with endocytosis (red circle), PIP3/PI3K signaling (blue circle), or both (yellow circle) are shown. The absence of association with either endocytosis or PIP3/PI3K signaling is denoted by a white circle (An A2).

Scores plot analysis (

The reduction of the mass spectrometry dataset to nine highly informative phosphorylation sites on six proteins suggests elements of network architecture that likely control migration and proliferation, namely endocytosis and signaling through PIP3- and PI3K-mediated pathways. Three of the six highly informative proteins, TfR, SHIP-2, and ACK, are all linked to endocytosis [

The PLSR model's ability to predict levels of proliferation and migration in 24H cells given only data from parental cells indicates that, although signals drastically change as we move from parental to 24H cells, the cell decides upon levels of migration and proliferation according to the same “rules.” These rules are nonintuitive but amount to the calculation of behavior according to the regression equation given by the PLSR model. Identification of conserved algorithms used to control behavior across cell type highlights the potential to predict a priori how changes in signaling will affect cell behavior and gives insight into conserved themes for cellular decision-making processes. Thus, the linear mapping of phospho-proteomic data onto cellular phenotype identified a key set of signals descriptive and predictive of phenotype in breast epithelial cells. It also identified subsets of signals that govern phenotype under either ligand or receptor perturbation, and in that process revealed new hypotheses about HER2-mediated signaling events. Of course, these hypotheses need to be tested through further focused molecular and biochemical work. Nevertheless, the modeling approach we introduce here is a powerful first step toward understanding signaling networks and the behaviors they control.

Samples were analyzed using mass spectrometry as previously described [

Migration was assayed as previously described [

Proliferation was assayed as previously described [

The PLSR model was generated using a SIMCA-P (10.0) software package as described elsewhere [

PLSR was used to solve the regression problem:
_{i} is called the scores vector and represents one dimension in the orthogonal basis set for the column space, and p_{i} is called the loadings vector and represents one dimension in the orthogonal basis set for the row space. Application of NIPALS in this way is analogous to the singular value decomposition of the matrix such that:

The NIPALS algorithm was implemented as described elsewhere [

The PLS regression vector b' is defined as:

The set of vectors t,u,w, and c are associated with the maximum eigenvalues for various covariance matrices, and once defined, their contribution is removed from the ^{*}, which are calculated from w to relate to the original X-matrix (and not the residual as calculated above) as:

Each model was tested for goodness of prediction (Q^{2}) using a leave-one-out cross-validation approach [

Q^{2} is then calculated as:

To evaluate the scores plot transitions (_{j,1:A}_{k,1:A}^{*}_{m,1:A}

To identify the signaling metrics most important for the overall model, a weighted sum of squares (also known as the variable importance for projection [VIP]) value for each variable (k) was calculated according to the following formula [_{T} is the total number of variables and the rest of the variables are as described above. Signals having multiple metrics that ranked in the top 30 highest VIP scores were chosen for the reduced model.

To evaluate the importance of a given signal for an output, the inner product of each metric with a given output was evaluated by:

The listed metrics are ranked from most positively correlated to negatively correlated for the indicated transition. A table of abbreviations including accession number, residue, and sequence is also provided in

(63 KB XLS)

Results from PLSR modeling show that a computational model based on experimental data generates predicted values of cellular migration and proliferation that correlate strongly with experimentally measured values. Experimental values of migration (A) or proliferation (B) are graphed along the ordinate, and model predictions of migration (A) or proliferation (B) are given along the abscissa. R^{2} values indicate the data's goodness of fit to the line y = x. Experimental error bars denote 95% confidence intervals for migration (see

(217 KB PDF)

A PLSR model constructed from parental cell data only was used to predict (A) proliferation and (B) migration levels in 24H cells and then compared with measured experimental values. Experimental error is denoted by 95% confidence intervals for cell migration (see

(205 KB PDF)

Analyzed results from PLSR-generated scores plots reveal the 20 signaling metrics most positively correlated with cell behavior as HER2 levels are increased.

(38 KB DOC)

(63 KB XLS)

Accession numbers and further protein information are listed in

The authors thank Kevin Janes and Gilbert Strang for excellent technical discussions, Arthur Goldsipe and Matt Lazzara for critical reviews of the manuscript, and Kelly Sherman for assistance with figure preparation.

activated cdc42-associated kinase

desmocollin-3

epidermal growth factor

epidermal growth factor receptor

ephrin A2 receptor

extracellular regulated kinase 1

focal adhesion kinase

glucocorticoid receptor DNA binding factor 1

heregulin

human epidermal growth factor receptor

human mammary epithelial cells

mitogen-activated protein extracellular kinase

nonlinear iterative partial least squares

phosphoinositide 3-kinase

partial least squares regression

phosphatidylinositol-3,4,5-trisphosphate

serine/threonine protein kinase PRP4 homolog

protein tyrosine phosphatase receptor type A

solute carrier protein 38

SH2-containing protein

SH2-containing inositol polyphosphate 5-phosphatase

human transferrin receptor

transforming growth factor alpha

variable importance for projection