A Comprehensive In Silico Analysis on the Structural and Functional Impact of SNPs in the Congenital Heart Defects Associated with NKX2-5 Gene—A Molecular Dynamic Simulation Approach

Congenital heart defects (CHD) presented as structural defects in the heart and blood vessels during birth contribute an important cause of childhood morbidity and mortality worldwide. Many Single nucletotide polymorphisms (SNPs) in different genes have been associated with various types of congenital heart defects. NKX 2–5 gene is one among them, which encodes a homeobox-containing transcription factor that plays a crucial role during the initial phases of heart formation and development. Mutations in this gene could cause different types of congenital heart defects, including Atrial septal defect (ASD), Atrial ventricular block (AVB), Tetralogy of fallot and ventricular septal defect. This highlights the importance of studying the impact of different SNPs found within this gene that might cause structural and functional modification of its encoded protein. In this study, we retrieved SNPs from the database (dbSNP), followed by identification of potentially deleterious Non-synonymous single nucleotide polymorphisms (nsSNPs) and prediction of their effect on proteins by computational screening using SIFT and Polyphen. Furthermore, we have carried out molecular dynamic simulation (MDS) in order to uncover the SNPs that would cause the most structural damage to the protein altering its biological function. The most important SNP that was found using our approach was rs137852685 R161P, which was predicted to cause the most damage to the structural features of the protein. Mapping nsSNPs in genes such as NKX 2–5 would provide valuable information about individuals carrying these polymorphisms, where such variations could be used as diagnostic markers.


Introduction
Single nucleotide polymorphisms (SNPs) are the most common genetic variations in any population; they occur when a single nucleotide in the genome (A, T, C or G) is altered [1]. Even though many SNP's have no effect on the biological functions of the cell, some can predispose people to certain diseases, influence their immunological response to drugs and can be considered as biomarkers for disease susceptibility [2]. Importantly, nsSNPs result in changes to the amino acid sequence of proteins and have been reported to be responsible for about 50% of all known genetic variations that are linked to inherited diseases [3]. On the other hand, coding synonymous (sSNPs) and those occuring seen outside gene coding or promoter regions may also influence transcription factor binding and gene expression [4,5].
Even though the influence of genetics on susceptibility to cardiovascular diseases is well documented, delineation of the complete spectrum of the risk alleles was not achieved previously until the development of Next Generation Sequencing Techniques [6]. The recent advancement in high throughput sequencing has increased the rate at which DNA sequence variations are identified and subsequently deposited in genetic databases [7]. Moreover, because of the availability of such sequencing data from many databases, researchers have turned to bioinformatics tools to exploit these data and try to annotate and extract useful clinical information hidden within.
There is a great need for an effective and efficient method to filter out pathogenic and deleterious SNPs from the readily available pool of variant data, and to further explore the impact of those selected SNPs at the molecular level. Bioinformatics prediction tools can be used in a cost efficient manner for prioritizing SNPs of likely functional importance, enabling an investigation of the structural basis of disease-causing mutations likely to contribute to an individual's disease susceptibility [8][9][10]. Also, it is important to note that the success of association studies always depends on how a research group chooses a set of SNPs to be investigated [11,12]. Without a detailed In silico analysis of SNPs to be screened, based on the functional importance, a large number of samples might be needed to identify association at an acceptable level of statistical significance [13,14]. A comprehensive analysis of the functional and structural impact of SNPs in a gene will not only be supportive but also facilitate the discrimination between true associations and false positives as reported recently [8][9][10].
In recent years, there has been a considerable interest to study the genetic determinants of Congenital heart defects (CHD) as this was reported to be an important cause of childhood morbidity and mortality worldwide [6,[15][16][17]. The American heart association estimates that about 9 children out of 1000 are born with CHD [18]. Even though significant progress has been achieved in diagnostic and therapeutic strategies, the etiology of CHD is not well understood. But recent advancements in sequencing techniques have led to increasing evidences implicating a stronger role of single gene defect associated with various kinds of CHDs (16,17). NKX 2-5 is an important gene that has been linked to CHD [19]. This gene encodes a homeobox-containing transcription factor which regulates tissue-specific gene expression involved in early heart formation and development. Mutations in this gene can cause different forms of congenital heart defects, including Atrial septal defect (ASD), Atrial ventricular block (AVB), Tetralogy of fallot (TF), Ventricular septal defect (VSD) [20]. Although several studies have been reported about associated SNPs in the NKX2-5 gene [21,22], a molecular dynamics simulation analysis has not yet been performed to gain insight in to the impact of nsSNPs on the gene's structural integrity.
The correlation between MDS analysis and experimental work is well established in various independent studies elsewhere [23,24]. On the other hand, Kumar et.al, [25][26][27], Doss et.al, [28][29] and Rajendran et.al, [30][31] have recently demonstrated that computational SNP predictions are helpful in combination with MDS studies in finding out the most significant disease causing mutation, out of a pool of SNPs having predictable correlations with the wet lab experiments. Thus, MDS analysis was presumed to generate detailed information on the structural changes including residue fluctuations and conformational changes of protein resulting from a pathogenic mutation in perfect agreement with experimental methods. In the present study, a special focus has been given on the MDS analysis of potentially pathogenic SNPs towards revealing 1) the structural impact of the following SNP's-rs72554027 (F145S), rs397516909 (S146W), rs201582515 (V150I), rs137852685 (R161P), rs104893900 (T178M), rs3729938 (S179C), rs72554028 (Q181H) rs137852686 (K183E) and rs104893906 (R190C) 2) whether these mutations are causing any structural or conformational changes in the proteins, with reference to the wild type.
In summary, our work aimed to identify pathogenic variants and to find out which associated SNPs are causing the most damage to the structural features of the NKX2-5 protein, thus negatively affecting its biological functions.

Selection of SNP data set from db SNP
The dbSNP database reported a total of 1345 SNPs for the NKX2-5 gene, out of which 252 were found to be Human (active) SNPs which included: 65 coding nsSNPs, 29 coding synonymous, 78 in the mRNA 3' UTR region, 16 in the mRNA 5'UTR region and 64 in the intronic region (Table 1). It was noted that the vast majority of SNPs of this gene fall in the coding region, and the number of nsSNPs were higher than all the other types of SNPs. W this observation was different from the usual distribution of SNPs reported in many other genes (8,31). Since non-synonymous SNPs could alter the encoded amino acid and are likely to be disease causing, we selected them for further analysis.

Analysis of mutations in nsSNPs
Among the 65 nsSNPs selected for further analysis, 24 were predicted to be deleterious by the SIFT server with a tolerant index score less than or equal to 0.03 and the detailed result has been tabulated in S1 Table. To further enhance the significance of SIFT predictions, the nsSNPs that were submitted to SIFT were also submitted to Polyphen 2.0 server. HDiv data set in Polyphen identified 52 SNPs as probably damaging with a high confident prediction score close ranges from 0.99 to 1. Whereas, HvarPred data set reported 9 SNPs as probably damaging with a prediction score between 0.99 to 1. Predicted HDiv damaging probabilities for the SNPs can be seen in S1 Each amino acid has its own unique properties, such as molecular weight, size, polarity/charge and hydrophobicity values. In view of this fact, we have studied the effect of SNPs on the protein structure, but paid more attention to its polarity and hydrophobicity, since they are the major contributors for the protein's structure and functionality. The results of the detailed analysis for effect of mutations on protein's polarity and hydrophobicity are presented in Table 2. From the analysis it was revealed that, among the nine presently available mutations, the majority of mutations have shown polarity changes with the exception of V150I. On the other hand, potentially hydrogen bond forming residues, serine and threonine, were found to be mutated with non-hydrogen bond forming residues, whereas in some cases, non-hydrogen bond forming residues have been found to be mutated with hydrogen bond forming residues serine, cysteine and methionine. The above predictions were based on the standard evaluations that Glutamine (Q); Asparagine (N); Histidine (H); Serine (S); Threonine (T); Tyrosine (Y); Cysteine (C); Methionine (M); Tryptophan (W) were all potentially hydrogen bond-forming residues. Polarity changes in the protein, due to the above-mentioned mutations, might cause severe malfunctions in the protein with even minor changes in pH conditions. Moreover, the nine mutations that were mentioned above were found to be causing hydrophobicity changes in the protein along with mutations at F145S and R161P positions. Among these mutations, F145S was found to be altered from a hydrophobic residue to hydrophilic one, whereas R161Pwas observed to be converted from hydrophilic to hydrophobic residue are shown in Table 2.

MD simulations
In order to understand the conformational changes in the protein due to these mutations comparatively, we have carried out 10 nanoseconds of MDS for each protein. Various parameters have been analyzed throughout the simulation trajectory, especially Root mean square deviation (RMSD), Root mean square fluctuations (RMSF), energy parameters, total number of intra -molecular hydrogen bonds, radius of gyration and the secondary structure elements (SSE) of the protein with the time dependent function of MDS. All the statistically significant results of the simulations have been presented in the Table 3. Chemical timescale used in this present study is of enough for the side chain rearrangements in protein's native state and to facilitate various conformations; on the other hand, recent studies demonstrated thatthe

Protein structure conformational flexibility and stability analysis
RMSD values of the wild type and mutant proteins were analyzed in order to understand the effect of the mutations on the protein structure. We calculated the RMSD for all protein backbones during the MDS with reference to its initial structure. Fig 1 shows   the wild type protein was found to be stabilized at an RMSD value of around 5 Å, whereas most of the mutant protein RMSD values were lower than the wild type, but unstable with exception of S146W, T178M, R190C and T178M which were higher than the wild type protein's RMSD at certain points. Fig 1 clearly demonstrates that the mutations have considerable destabilizing effects on protein structure. We have also monitored the RMSF fluctuations of each residue in order to determine the mutation's effect on the protein residues dynamic behavior. From Fig 2, it can be inferred that residue level fluctuations for R190C were quite high when compared with native and other mutations up to 4 Å, for residues located between 140 and 150 positions, while the next highest peak was observed approximately at the residue position at 175. Analysis of the fluctuations revealed that the greatest degree of flexibility was shown by the R190C mutant protein.
When the energy parameters for the MD simulation trajectories of the wild type along with associated mutated proteins were analyzed, it was revealed that they were maintaining the total energy of the system in a range of -2834 and -2549Kcal/mol (Fig 3) whereas, for the wild type protein, -2676 Kcal/mol of energy was noted. Among the above mentioned energies,S146W mutated protein was showing the least possible minimized total energy of -2834 Kcal/mol, whereas the highest energy of -2549 Kcal/mol was found to be consumed by Q181H mutation. Also, as shown in Table 3, it is clear that as the mutation residue position increases, the total energy of the system increases as well, suggesting that the mutations occurring at the core of the protein structure would further minimize its overall energy, which is somehow contrary to the case of mutations occurring near the 3' end of the protein.
When Vander Waal's interaction contributions were calculated, it was found to be in a range of -98 and -115 Kcal/mol. The least energy of -115 Kcal/mol was observed to be for R190C mutation, whereas the highest -98 kcal/mol of energy was observed to be for the wild type protein. When Coulomb's energy of the protein was analyzed, S146W was found to be maintaining the highest average of -1446 kcal/mol of energy and Q181H was found to be maintaining the least average of -1366 kcal/mol, whereas, the wild type protein was found to be maintaining an average of -1393 kcal/mol of energy.
Next, we have analyzed the total number of intra-molecular hydrogen bonds present in the protein along with its reported mutations contributing for their stability (Fig 4). From Table 4 it can be inferred that the maximum number of 45 intra-molecular hydrogen bonds was observed for Q181H followed by 44 for wild type protein and the F145S mutation, followed by 43 for S179C, K183E and R190C mutations. The least number of 42 intra H-bonds were found for V150I, R161P and T178M mutations. This data suggests that V150I, R161P and T178M mutations have higher flexibility compared to other mutations and wild type protein.
Finally, we have analyzed the radius of gyration (ROG) for the native wild type protein along with its associated mutations contributing to their compactness (Fig 5). From Table 3 it can be inferred that R190C mutation possessing protein has the least compactness of its structure with 12.5 Å when considering the statistical data, however R161P mutation has shown the highest compactness in the protein with 11.8 Å, whereas the wild type protein has shown to be highly compacted with 11.6 Å. These data suggests that all the mutations have caused structural destabilizing effects leading to the loss of protein compactness when compared to the data with wild type ROG.
When the secondary structure elements (SSE) contributing to the overall protein stability were analyzed, it was observed that all the proteins were maintaining an average of around 64% SSE, mostly composed of helices rather than strands and loops, with an exception of R161P mutation. When we further investigated why R161P was showing less SSE percentage comparatively, it was revealed that residues present at 20, 40 and 50 series of position (Fig 6) in the wild type were found to be converting from strands to loops leading to the loss of SSE elements and probably causing damage to its overall stability and conformational status (Fig 7).

Selection of SNPs for in silico analysis
The dbSNP was used to retrieve SNPs reported in NKX 2-5 gene. (http://www.ncbi.nlm.nih. gov/snp, access date: June 30 2014) [33]. We have selected the nsSNPs s for further analysis.

Prediction of deleterious ns SNPs by In silico tools
Many computational biology tools are available that are used to determine whether an nsSNP is expected to be neutral or pathogenic, but we have utilized the most widely accepted computational algorithms, namely Sorting intolerant from tolerant (SIFT) and Polymorphism phenotyping (Polyphen). First, pathogenic nsSNPs in the NKX2-5 gene were identified using these tools and, then a MDS analysis was performed to investigate the possible effects that amino acid variants may have on NKX2-5 protein structure. These methods rely on either sequence information or structural information, or both, to predict the functional impact of SNPs. The methods used in this study were also the same as described in recent publications using default parameters [10].

Sorting tolerant from intolerant
3.3.1 SIFT (http://sift.jcvi.org/www/SIFT_dbSNP.html, access date: June 30 2014) is a multistep algorithm that predicts whether an amino acid substitution would affect protein function or not, by filtering out the mutations based on tolerance score [34]. SNP data was retrieved from dbSNP as rs IDs, then submitted to the SIFT server for analysis. The SIFT server reports the results as prediction scores between 0 and 1. A score within a range of 0-0.05 is considered to be deleterious or pathogenic, whereas scores above 0.05 to 1 are considered to be neutral or non-pathogenic.  3.3.2 Polyphen 2.0 (http://genetics.bwh.harvard.edu/pph2/dbsearch.shtml; access date July 2 2014) is also a high quality sequence alignment pipeline that uses a combination of sequence and structure based attributes where the effect of mutation is predicted by a native Bayesian classifier [35]. This program has been trained on two pairs of existing data sets, one was the HumDiv and other was Hum var present in the UnitProtKB database. The queries were submitted to the polyphen2 server in the form of dbSNP IDs.The output levels would be appraised qualitatively as benign, possibly damaging(less confident prediction) and probably damaging (more confident prediction) based on pairs of false positive rate (FPR) threshold, which is an added advantage with regards to prediction accuracy. If the Polyphen score is greater than or equals 0.5, it can be classified as deleterious, and if the score is less than 0.5 it can be regarded as benign or tolerated.  sequences (access date July 2 2014) and the PANTHER calculates the substitution positionspecific evolutionary conservation score (subPSEC) based on an alignment of evolutionary related proteins. PANTHER subPSEC score ranges from 0 to about -10. Protein sequences having a subPSEC score -3 are considered as most likely to be deleterious.

Modeling of the mutant protein structure
The three-dimensional structure of the protein is crucial in order to study its functionality, especially when trying to understand the effect of SNP's on its overall structure and function. We used the rcsb.org to identify the protein coded by NKX 2.5 (PDB ID 3RKQ), which is a 58 amino acids in length starting from the residue at position 138 and ending at 194. Even though there were about 49 reported mutations (Table 4) in NKX 2.5 gene, we have studied nine mutation effects due to lack of the 3D structural information of this protein in comparison of the wild type protein. The modeling of the mutated protein structures have been performed using the "mutate a residue" tool in the Schrödinger maestro v9.6 visualization program [38] using the wild type available 3D structure as a reference.

MD simulations in water
"Desmond v3.6 Package" [39-40] was used to run the molecular dynamic simulations. Predefined TIP3P water model [41] was used to simulate water molecules. Orthorhombic periodic boundary conditions were set up to specify the shape and size of the repeating unit buffered at 10 Å distances. In order to neutralize the system electrically, appropriate counter NA+/Cl-ions were added to balance the system charge and were placed randomly in the solvated system. After building the solvated system, we have performed minimization and relaxation of the protein/protein-ligand complex under NPT ensemble using default protocol of Desmond as followed elsewhere [42][43]; which includes a total of 9 stages among which there are 2 minimization and 4 short simulations (equilibration phase) steps are involved before starting the actual production time.    [44][45]. The temperature and pressure were kept at 300 K and 1 atmospheric pressure respectively using Nose-Hoover temperature coupling and isotropic scaling [46], the operation was followed by running the 10ns NPT production simulation each and saving the configurations thus obtained at 5ps intervals.

Analysis of molecular dynamics trajectory
The MDS trajectory files were analyzed by using simulation quality analysis (SQA), simulation event analysis (SEA) along with simulation interaction diagram (SID) programs available with the Desmond module for calculating the Energies, root-mean-square deviation (RMSD), rootmean square fluctuation (RMSF). Total intra molecular hydrogen bonds, Radius of Gyration along with secondary structure elements (SSE) of the protein contributing to the structural stability. SQA is a useful parameter to qualitatively validate the system stability throughout the simulated length of chemical time for the given temperature, pressure and volume of the total simulation box. Whereas, SEA is used towards analyzing each frame of simulated trajectory output and SID has been employed especially towards estimating the total SSE change in the protein structure during the simulation time.

Conclusions
The present study offers an insight into the genotype-phenotype association of deleterious nSNPs associated with NKX 2-5 gene. Our study identified nine pathogenic SNPs: rs72554027 (F145S), rs397516909 (S146W), rs201582515 (V150I), rs137852685 (R161P), rs104893900 (T178M), rs3729938 (S179C), rs72554028 (Q181H) rs137852686 (K183E) and rs104893906 (R190C) in NKX2-5. Furthermore, the MDS analysis revealed their respective major consequences on the native homeobox-containing transcription factor, which is encoded by NKX 2-5 gene. RMSD, RMSF, energy parameters, intra molecular H-Bonds, Radius of Gyration and secondary structure element plots revealed their plausible malfunctioning mechanism via their structural destabilization. Compared to the wild type, all the selected mutations were altering the structural behavior of the mutated protein, however R161P (rs137852685) predicted to cause the most damage to the protein's structural features followed by rs104893906 (R190C). Overall, the present computational approach will provide a comprehensive view on destabilizing mechanisms of homeobox-containing transcription factor SNPs in NKX2-5. The knowledge thus acquired through this present study is expected to help in prioritizing the important nsSNPs to be selected for further wet lab evaluations, and is of high value especially when designing huge population based genotyping studies.