Computational Screening and Molecular Dynamic Simulation of Breast Cancer Associated Deleterious Non-Synonymous Single Nucleotide Polymorphisms in TP53 Gene

Breast cancer is one of the most common cancers among the women around the world. Several genes are known to be responsible for conferring the susceptibility to breast cancer. Among them, TP53 is one of the major genetic risk factor which is known to be mutated in many of the breast tumor types. TP53 mutations in breast cancer are known to be related to a poor prognosis and chemo resistance. This renders them as a promising molecular target for the treatment of breast cancer. In this study, we present a computational based screening and molecular dynamic simulation of breast cancer associated deleterious non-synonymous single nucleotide polymorphisms in TP53. We have predicted three deleterious coding non-synonymous single nucleotide polymorphisms rs11540654 (R110P), rs17849781 (P278A) and rs28934874 (P151T) in TP53 with a phenotype in breast tumors using computational tools SIFT, Polyphen-2 and MutDB. We have performed molecular dynamics simulations to study the structural and dynamic effects of these TP53 mutations in comparison to the wild-type protein. Results from our simulations revealed a detailed consequence of the mutations on the p53 DNA-binding core domain that may provide insight for therapeutic approaches in breast cancer.


Introduction
One of the common malignancies and leading causes of cancer death faced by women around the world is breast cancer. Globally, the death rate of breast cancer has been rising around 2.5 lakhs in 1980 to 4.25 lakhs in 2010 [1]. Even in countries like China and India, its incidence is increased around 30% over the last decade whereas in Japan, Korea and Singapore it was doubled or even tripled [2]. According to National Cancer Institute (USA) statistics, estimated new cases of breast cancer in United States for the year 2013 is 232,340 in female and 2,240 in male whereas estimated breast cancer deaths are 39,620 in female and 410 in male. Some of the common risk factors for breast cancer can be broadly categorized into two types i.e., genetic and non genetic. Among these two risk factors, genetic risk factors constitute 5-10% of the breast cancer cases. Studies showed that fifty one variants in 40 genes are significantly associated with breast cancer risk and among them variants in six genes i.e., BRCA1, BRCA2, TP53, PTEN, STK11 and CDH1 show strong association whereas variants in four genes i.e., ATM, CHEK2, BRIP1, PALB2 show moderate association and approximately 20 variants in other genes show weak association [3,4].
Among the genes conferring high breast cancer risk, TP53 is known to be mutated in 30% of the breast cancers cases with a higher frequency in some tumor subtypes [5]. TP53 encodes p53, which is one of the most important tumor suppressor proteins in human cancers. p53 is a multi domain protein with 393 residues containing i) an acidic N-terminal transcription activation domain ; ii) a proline-rich regulatory domain (62-94); iii) a central sequence-specific well conserved DNA-binding domain (110-292); iv) an oligomerization domain (325-363) and v) a C-terminal domain containing multiple regulatory signals (363-393) [6]. It functions as a tetramer. It is reported that 75% of all the mutations in TP53 are missense, resulting in the substitution of a single amino acid with another and these mutations are predominantly distributed in the exons 4-9, encoding the DNA-binding domain of the protein [7].
Understanding how these single nucleotide polymorphisms (SNPs) affect the function of proteins is an important area of research and an efficient identification of such SNPs would be useful for SNP selection in genetic studies to understand the molecular basis of disease and predicting the effects of in vitro and in vivo mutagenesis experiments [8]. Among the SNPs, nonsynonymous coding SNPs (nSNPs) are the one which are located in the coding regions resulting in an amino acid variation in the protein products of genes. They are believed to have a high impact on the phenotype [9]. In the present study, we have focused on the nSNPs in the coding region of TP53 gene having an impact on breast cancer phenotype. We have explored the possible relationship between genetic mutation and phenotypic variation using different computational algorithm tools SIFT, PolyPhen-2 and Mutdb for prioritizing the deleterious breast cancer associated nSNPs from dbSNP datasets.
Molecular dynamics simulation (MDS) on the other hand is an important tool for understanding the effect of mutations on the protein structure, as it provides the information about the protein at atomic level on a reasonable time scale. Previously, several studies have utilized molecular dynamics to analyze the impact of mutations on TP53 [10][11][12][13][14]. In order to check (i) whether the three mutants (R110P, P151T and P278A) have an impact on the conformation in the functionally significant regions of the p53C? (ii) Whether the mutant structures are deviating from the native p53C? (iii) Whether the mutants are changing the flexibility of the p53C? we have performed molecular dynamics simulations of WT and three mutants. Since, a significant fraction of p53 appears in apo state at physiological temperature and insufficient zinc is linked to misfolding, particularly in tumorigenic mutations, we performed both apo (Zinc-free state) and holo simulations and are presented here. Results showed that three mutants R110P, P151T  and P278A are known to confer deleterious effect in the p53 DNAbinding core domain region (p53C). Overall, the objective of our study is to predict the breast cancer associated nSNPs and to further reveal the conformational flexibility of mutated apo and holo p53C through extensive molecular dynamic simulation.

Prediction of deleterious coding nSNPs
We used both SIFT and Polyphen-2 to screen out the deleterious coding nSNPs from other SNPs for TP53. 'Sorting Tolerant From Intolerant' (SIFT) (http://sift.jcvi.org/; access date: May 15, 2013) is a multi-step algorithm that uses a sequence homology based approach [16] to predict whether an amino acid substitution in a protein will affect the protein function or not. For a given protein sequence, SIFT chooses the related proteins and obtains an alignment of them with the query and assigns scores to each residue. Scores ranging from 0-0.05 are considered to be intolerant or deleterious amino acid substitutions, where as scores ranging from 0.05-1 are considered be tolerant or neutral [17,18]. We submitted our query in the form of dbSNP id to SIFT server. PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/; access date: May 18, 2013) [19] on the other hand is a web based server that predicts the functional significance of an allele replacement from its individual features by Naïve Bayes classifier. PolyPhen-2 prediction models were tested and trained using two pairs of datasets, one is HumanDiv compiled from all damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProtKB database, together with differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging and the other is HumVar, consisted of all human disease-causing mutations from UniProtKB, together with common human nsSNPs (MAF.1%) without annotated involvement in disease, which were treated as non-damaging. A mutation is appraised qualitatively, as benign, possibly or probably damaging based on pairs of false positive rate (FPR) thresholds, optimized separately for each model. Query was submitted in the form of dbSNP id to WHESS.db: a quick access to precomputed set of PolyPhen-2 predictions for whole human exome sequence space (WHESS).

Prediction of phenotypic consequence for deleterious coding nSNPs
The phenotypic consequences of deleterious coding nSNPs were predicted using MutDB (http://www.mutdb.org/cgi-bin/mutdb. pl; access date: May 20, 2013), a tool that integrates publicly available databases of human genetic variation with molecular features and clinical phenotype data [20]. Gene symbol ('TP53') was used as a search term.

Modeling nSNPs locations on protein structure and Molecular dynamics
To investigate the mechanism of structural consequences of the mutations on TP53 we performed molecular dynamics. Initial coordinates were extracted from the crystal structure of p53 core domain in the absence of DNA (PDB ID: 2ocj, chain A; resolution 2.05 Au) [21]. All water molecules were removed from the crystal structure and the mutants (MTs) R110P, P151T, P278A were created by replacing the wild-type (WT) protein residue with its polymorphic residue using PyMOL [22]. Molecular dynamic analysis was performed at 37uC (physiological temperature) and neutral pH using GROMACS 4.5.3 (http://www.gromacs.org/) [23][24][25]. The p53 core domain contains Zn 2+ that is essential for activity. A LIGPLOT [26] scheme of Zn 2+ interaction in the crystal structure of p53 core domain (PDB ID: 2ocj, chain A; resolution 2.05 Au) was shown in the Fig. 1 given below. Zn 2+ remains bound to p53 core domain at temperatures below 30uC and it rapidly dissociates at physiological temperature such that a significant fraction appears in the apo state [27]. Consequently, we focused on both apo and holo simulations of wild (WT) and mutant type (MT) p53 core domain and presented here. The system was solvated by adding explicit flexible SPC water [28] embedded in a cubic box and the walls were located $10 Å from all protein atoms. Cl 2 counter ions (5, 3, 3, 5, 3, 5, 2 and 4 for holo WT, apo WT, apo P151T, holo P151T, apo P278A, holo P278A, apo R110P and holo R110P respectively) were added to neutralize the total charge of the system. The box size was set to 4.833 nm64.027 nm64.794 nm with box vectors 7.3 nm67.3 nm67.3 nm and box angles 90 0 for each side. Each solvated structure was energy minimized for 50000 steps of steepest descent minimization terminating when maximum force is found smaller than 1000 KJ/mol 21 /nm 21 . After energy minimization, the system was subject to equilibration at constant temperature (300K) and pressure (1 bar) with a time step of 2 fs and non bonded pair list updated every five steps under the conditions of position restraints for heavy atoms and LINCS constraints [29] for all bonds. The temperature was kept constant using a Berendsen thermostat [30]. Electrostatic interactions were calculated using the particle mesh Ewald summation method [31]. Finally, eight (i.e., apo WT, holo WT, apo R110P, holo R110P, apo P151T, holo P151T, apo P278A and holo P278A respectively) 10 ns Molecular dynamics simulations (MDS) were performed.

Analysis of Molecular dynamics trajectories
Comparative analysis of structural deviations in native and mutant structure such as root mean-square deviation (RMSD), root mean-square fluctuation (RMSF), solvent-accessible surface area (SASA), secondary structure calculation etc., were computed using g_rms, g_rmsf, g_sas and g_gyrate built in functions of GROMACS package. The average number of protein-solvent intermolecular hydrogen bonds was computed and analyzed using g_hbond. A cutoff radius of 0.35 nm was employed between the donor and the acceptor. Density map was plotted using g_densmap whereas average values of simulation output data was plotted using g_analyze of GROMACS respectively. Graphs were plotted using GRACE software (http://plasma-gate. weizmann.ac.il/Grace/).

Principal component analysis
To analyze and visualize the overall motions in the simulations, essential dynamics (ED) was carried out according to the protocol in GROMACS software package [32]. Covariance analysis, also   Table 2. Prediction scores found to be functionally significant by Polyphen-2 server.   Covariance matrices of both WT and MT simulations were constructed using the Ca atoms trajectory as it has been shown that they contain all the information for reasonable description of the protein large concerted motion [32]. We used gromacs utilities g_covar and g_anaeig for analyzing the trajectories.

Results and Discussion
SNP data set from dbSNP dbSNP database contain a total of 14613 SNPs for TP53 gene, out of which 637 were found to be Human (active) SNPs (i.e., Active Human RS and not including those that have been merged). Among the 637 Human (active) SNPs, 100 were coding nSNPs, 31 were coding synonymous, 52 SNPs were in the mRNA 39 UTR region, 38 were in the mRNA 59 UTR region and 451 were in the intronic regions. It can be seen from the Fig. 2 that the vast majority of SNPs occur in the intronic region (70.8%) and more SNPs are nSNPs (15.6%) compared to synonymous SNPs (4.8%), SNPs occurring in the mRNA 39 UTR (8.1%) and 59 UTR (5.9%) regions. We selected coding nSNPs for our investigation.

Deleterious nSNPs by SIFT server
Among the 100 nSNPs, from dbSNP 18 were found to be deleterious, with a tolerance index score of less than or equal to 0.05. We observed that among 18 deleterious nSNPs, 11 had a highly deleterious tolerance index score of 0.00 using orthologues and homologues in the protein alignment. Remaining 7 deleterious nSNPs had a tolerance index score of 0.01, 0.02, 0.03, 0.04 and 0.05 using orthologues and homologues in the protein alignment respectively (Table 1). Among 18 nSNPs that are predicted to be deleterious, three nSNPs showed a nucleotide change of A/T, five showed a change of A/G, one showed a change of C/G, five showed a change of C/T, one showed a change of G/T, one showed a change of C/G/T, one showed a change of A/G/T and one showed a change of A/C/T respectively. Compared to other nucleotide changes C/T and A/T change occurred the maximum number of times. Amino acid change on the other hand was majorly found to be from special amino acids to polar amino acids with uncharged R groups. Among them eight showed a change at the region of Arginine residues, three showed a change at the region of Proline residues, two showed a change at the region of Cysteine and the remaining five showed changes at the regions of Serine, Glycine, Leucine, Methionine, Glutamic Acid (Table 1).

Damaged nSNPs by PolyPhen-2 Server
To predict the functional significance of an allele replacement, 100 nSNPs analyzed by SIFT were submitted to PolyPhen-2 server. PolyPhen-2 qualitatively predicts whether a mutation is benign, possibly damaging, or probably damaging using two Bayesian probabilistic models, HumDiv and HumVar. For Mendelian disease diagnostics, the HumVar model is recommended as it should distinguish mutations with drastic effects from normal human variation whereas the HumDiv model is recommended for identifying variants where even mildly deleterious alleles are treated as damaging [19]. Among 100 nSNPs from dbSNP submitted to Polyphen-2, 37 were found to be possibly damaging, or probably damaging by both HumDiv and HumVar predictions whereas 2 were predicted as possibly damaging by only HumDiv, 36 were predicted as benign by both HumDiv and Table 2. Cont.  Table 2. Only SNPs that are predicted as possibly damaging or probably damaging by both HumDiv and HumVar predictions were considered for our study.

Breast Cancer related mutations by Mutdb database
Results from both SIFT and Polyphen-2 analysis showed that among 100 nSNPs, only 15 SNPs were predicted to be deleterious or damaging on protein function. These 15 SNPs were submitted to Mutdb to confirm that they confer a breast cancer phenotype or not. Results showed that 3 SNP mutations i.e., rs11540654 (R110P), rs17849781 (P278A) and rs28934874 (P151T) are known to have a phenotype in Breast tumors (Table 3).

Molecular dynamics simulation studies
Results from the calculations of RMSD for backbone and Ca atoms, root mean square fluctuation (RMSF) for Ca atoms, radius of gyration (Rg) for Ca atoms and protein for apo and holo WT, R110P, P151T, P278A MDS were presented in the Table 4. To analyze the impact of MTs on the p53C, we have examined the RMSD values. The calculated RMSDs of the backbone atoms in apo and holo WT, R110P, P151T, P278A with respect to the starting structure during the 10-ns MDS as a function of time were plotted in the Fig. 3a, b. During the apo simulations, backbone RMSDs of the WT and MT structures showed a sharp increase in the first 3 ns followed by equilibrium around 6 ns and a sudden decrease around 7.5 ns for P151T, 9.5 ns for P278A and R110P (Fig. 3a) whereas in the case of holo simulations a different pattern was observed. A sharp increase is shown during the first 3.5 ns      Since, RMSD of the Ca atoms is a central origin to compute the protein system [33], we have calculated the respective Ca RMSDs for both apo and holo simulations and plotted in the Fig. 4, 5. During the apo simulations, Ca-RMSD of R110P showed a sharp increase in the initial 2.5 ns followed by equilibrium around 4 ns and a sudden decrease after 9 ns (Fig. 4a). However, P151T and P278A showed a different trend of Ca-RMSD, with an equilibrium around 4 ns and a sudden decrease around 7 ns for P151T (Fig. 4b) whereas an equilibrium around 6 ns and a sudden decrease around 9 ns for P278A (Fig. 4c). During holo simulations on the other hand, Ca-RMSD of R110P showed a less variation in the initial 2 ns followed by equilibrium around 4 ns and a sudden decrease after 5 ns (Fig. 5a). However, P151T and P278A showed a different trend of Ca-RMSD, with an equilibrium around 5 ns and a sudden decrease around 9.5 ns for P151T (Fig. 5b) whereas an equilibrium around 4 ns and a sudden increase around 7 ns for P278A (Fig. 5c). A comparison of average Ca-RMSD values showed the following order of structural deviations (Table 4): apo; R110P . WT . P151T . P278A, holo; R110P . P278A . WT . P151T. These results indicate that a greatest change was observed in the R110P compared to the other mutants in both apo and holo simulations.
In order to analyze the change in secondary structure patterns in WT and MTs, we applied the software tool DSSP (Database of Secondary Structure in Proteins) by Kabsch and Sander [34], which employs H-bonding patterns and various other geometrical features to assign secondary structure labels to the residues of a protein. We have plotted the secondary structure patterns between WT and MTs of both apo and holo simulations and also superimposed their respective structures at the beginning of the simulation and for specific time steps where the conformational drifts occurred at a higher range (Fig. 4,5). Analysis of time dependent secondary structure fluctuations through DSSP analysis showed a conformational drift from b-sheets to bend form between the residues 165-175 in R110P and a-helix to bend form for the 180 th residue in P151T and P278A during the apo simulations and a conformational drift from a-helix to bend for the 180 th residue in R110P, P151T and turn to bend form for the 130 th residue in P278A during the holo simulations (Fig. 4, 5). The conformational changes support our previous results obtained from RMSD analysis that major change occurred in the R110P (Fig. 2,4,5).
In order to understand how the mutants affect the dynamic behaviour of the residues and to examine the cause of conformational drifts observed in RMSD and secondary structure patterns, Ca-root mean square fluctuation (Ca-RMSF) of WT and MT amino acid residues were calculated and plotted in the Fig. 6a-d. Except P151T, in all cases the MT holo simulations had higher average Ca-RMSFs than the WT holo simulations (Fig. 6a) ( Table 4). In the apo and holo WT, more than 50% of the residues have RMSF values .0.1 nm (Table 5) indicating a higher level of fluctuation. During the holo simulations, all the MTs showed a larger percentage of residues (i.e., Ca residues and residues in the protein core comprising secondary structural elements) with RMSF values .0.1 nm whereas during apo simulations less percentage of residues showed RMSF values .0.1 nm compared to the WT. These results indicate that compared to apo, holo simulations are associated with increase in flexibilities in MTs. Among the holo MTs, R110P have a higher percentage of residues with RMSF .0.1 nm thus indicating a higher effect on the overall flexibility of the p53C. Table 6. Regions (a-helices, b-sheets, and loops) showing an average increase or decrease of RMSF in the MTs compared to the WT; RMSF of a particular structure is taken to be increased or decreased if there is an average change in RMSF of .0.03 nm in at least $50% of its residues.     Further, comparison of the regional flexibilities of the MTs showed a characteristic increase and decrease of the flexibility in certain loops, helices and b-sheets (Table 6). Strand S10 showed a consistently low fluctuation across all the simulations whereas higher fluctuations around the Zn 2+ binding residues, C176, H179, C238 and C242 were observed in apo WT and MTs compared to the holo WT simulations (Table 7). Loop regions on the other hand showed a larger fluctuation in both WT and MTs. Changes in the loops L3, L11, L12 and H2 helix contributed to a higher value of Ca-RMSF in the MTs with a larger portion of the L3 loop, S10 strand and H2 helix shifted far from its starting position (Fig. 6a). Compared to the WT simulations, the fluctuations observed were noticeably increased around the loops L3 and L7 in the R110P (Fig. 6b) whereas in P151T noticeable increase in fluctuations were observed around the loops L4 and L7 (Fig. 6c). P278A on the other hand showed an increase in fluctuations at the loops L3, L7, L11 and H2 helix regions (Fig. 6d). Results from the analysis of regional flexibilities indicate that all the three MTs R110P, P151T and P278A will affect the overall flexibility of p53C.
SASA is a geometric measure of the extent to which an amino acid interacts with its environment (the solvent and the protein core). It is naturally proportional to the degree to which an amino acid is exposed to these environments [35]. A rise or fall in the SASA designates the change in exposed amino acid residues thereby affecting the tertiary structure of a protein. Results from the analysis of SASA for apo and holo simulations showed a variation among the WT and MTs (Fig. 7a-b) . Rg on the other hand, is a parameter to describe the equilibrium conformation of a total system particularly in analyzing the proteins it is an indicative of the level of compaction in the structure, i.e. how the polypeptide chain is folded or unfolded [36]. Rg plot for Ca atoms and protein with time over the course of 10 ns simulations during apo and holo simulations is shown in the Fig. 8a-f and Fig. 9a-f. In the Rg plot for both Ca atoms and protein, we observed a notable fluctuation in MTs compared to the WT. Among the MTs, large deviations in Rg from the WT structure were observed during the apo and holo simulations of R110P (Table 4). These results indicate that compared to other MTs, p53C might have undergone a significant structural transition due to R110P.
Further, one of the factors that accounts for maintaining the stable conformation of a protein is hydrogen bonding. In order to understand the reason for flexibility among the MTs we have performed the NH bond analysis of WT and MTs during apo and holo simulations and plotted in the Fig. 10a-f. Results showed a notable difference in protein-solvent intermolecular hydrogen bond pattern between the WT and MTs. Among the MTs, a decrease in the average number of hydrogen bonds was observed in R110P compared to the WT during both apo and holo simulations (Fig. 10a,d) indicting that the occurrence of this mutation may lead to a more flexible conformation in the presence or absence of Zn 2+ at physiological conditions. However, the other MTs, P151T and P278A showed a decrease in average number of protein-solvent intermolecular hydrogen bonds only during holo simulations (Fig. 10e,f) indicating that these two mutants are flexible only in the presence of Zn 2+ at physiological conditions. Further, we have plotted the atom density distribution to check if the MTs have caused any major changes in the orientation and   atomic distribution. Results showed that atomic distribution of all the MTs were significantly differed from the WT in apo and holo simulations (Fig. 11a-h) indicating that all the MTs have a deleterious effect on the p53C.
Moreover, to identify the correlated motions of the WT and MTs during trajectory generated by apo and holo simulations and to support our MDS result we performed ED analysis. Since sum of the eigenvalues is a measure of the total motility in the system, we have plotted the eigenvalues against the corresponding eigenvector index for the first ten modes of motion at different trajectory lengths for WT and MTs during the apo and holo simulations in the Fig. 12 a, b. Only few eigenvectors showed large eigenvalues for both WT and MTs during the apo and holo simulations indicating that most of the internal motion of the protein is confined along small dimension in the essential subspace. The spectrum of eigenvalues in the Fig. 12 a,b indicated that major fluctuations of the system were confined to first two eigenvectors. Hence, the projection of trajectories of WT and MTs during the apo and holo simulations in the phase space along the first two principal components (PC1, PC2) at 300 K was plotted in the Fig. 13 a-f. Compared to apo simulation, during holo simulation MTs covered a larger region of phase space along PC1 and PC2 plane than WT. The overall flexibility of WT and MTs was calculated by the trace of the diagonalized covariance matrix of the Ca atomic positional fluctuations. Results from the trace of the covariance matrix (Table 4) confirmed the overall flexibility between MTs and WT at 300K during both apo and holo simulations. Overall the results reported from our study has confirmed that the substitution of Arginine at 110 th residue with Proline, Proline at 151 th residue with Threonine and Proline at 278 th residue with Alanine in the p53 core domain in the presence or absence of Zn 2+ has an altered structure stability and essential hydrogen bond formation and thus these three mutants might play a significant role in initiating the susceptibility towards breast cancer. Further, our analysis indicates that compared to other MTs P151T and P278A, amino acid substitution of Arginine at 110 th residue with Proline (R110P) exhibit a highly deleterious effect on the p53C.

Conclusion
The present study, will offer an in depth insight into the genotype-phenotype association of deleterious breast cancer associated nSNPs in TP53. Our study reports three mutations R110P, P151T and P278A associated with breast cancer phenotype and further the molecular dynamics revealed their respective major consequences on native p53 DNA-binding core domain. RMSD, Rg, SASA, NH bond and number density plots revealed their high deleterious nature on the p53 DNA-binding core domain in the presence and absence of Zn 2+ ion. Overall, the present computational approach will provide a comprehensive view on destabilizing mechanisms of p53 SNPs in breast cancer and it may serve as a useful model for predicting the effect of SNPs in other proteins. The results reported in this study elucidate the role of deleterious mutations in p53 which may provide a useful information for the design of p53 mutants based therapeutic strategies against breast cancer.