Application of Wavelet Transform for PDZ Domain Classification

PDZ domains have been identified as part of an array of signaling proteins that are often unrelated, except for the well-conserved structural PDZ domain they contain. These domains have been linked to many disease processes including common Avian influenza, as well as very rare conditions such as Fraser and Usher syndromes. Historically, based on the interactions and the nature of bonds they form, PDZ domains have most often been classified into one of three classes (class I, class II and others - class III), that is directly dependent on their binding partner. In this study, we report on three unique feature extraction approaches based on the bigram and trigram occurrence and existence rearrangements within the domain's primary amino acid sequences in assisting PDZ domain classification. Wavelet packet transform (WPT) and Shannon entropy denoted by wavelet entropy (WE) feature extraction methods were proposed. Using 115 unique human and mouse PDZ domains, the existence rearrangement approach yielded a high recognition rate (78.34%), which outperformed our occurrence rearrangements based method. The recognition rate was (81.41%) with validation technique. The method reported for PDZ domain classification from primary sequences proved to be an encouraging approach for obtaining consistent classification results. We anticipate that by increasing the database size, we can further improve feature extraction and correct classification.


Introduction
One of the most common and important protein domains that play an essential role in underlying cell signaling and organizing the post synaptic density region is represented by PDZ domain containing proteins [1][2][3]. Specifically, PDZ domain proteins have been implicated in functions such as maintainers of cell polarity, regulating the post-synaptic density by mediating protein-protein interactions, and in directing protein trafficking amongst other functions [4][5][6]. Furthermore, their function, or better yet malfunction, has been characterized in several disease states ranging from cystic fibrosis to cancer [7][8][9]. PDZ domains primary sequence is usually composed from 80 to 90 amino acids. In addition, most PDZ domains have a conserved 3D fold made of six β strands and two α helices. Almost exclusively, PDZ domains bind the C-terminal motifs of their ligand and target proteins, causing them to cluster [10]. However, occasional internal motifs binding PDZ domains have been observed [11,12]. The binding pocket of PDZ domains is formed by the conserved GLGF motif, present in most PDZ domains, which usually uses the last four C-terminal amino acids for target recognition. Historically, based on these C-terminal motifs binding the PDZ domain, classification of PDZ domains was proposed. The two most prominent PDZ domain classes are Class I and II. For Class I PDZ domains, a typical motif is S/T-X-F, where a hydrophobic amino acid (F) is at the C-terminus or position P0, followed by any amino acid (X) at P-1, and then Serine or Threonine at P-2. Alternatively, Class II PDZs would recognize ligand sequence with the F-X-Fmotif at the C-terminus.
In general, PDZ domain classification can be approached from either a structure or sequence based predictions [13][14][15][16]. Even though, structural based methods can provide an indepth understanding of PDZ binding, these methods can be plagued by low throughput and high cost [17][18][19][20]. Furthermore, they often provide a glimpse into a single or few PDZ domains, giving less focus to the role of point mutations in PDZ sequence, upstream and downstream sequences relative to the PDZ binding motif, as well as the exact PDZ domain cutoffs that affect ligand affinity [21][22][23]. More recently, several groups have started work on primary sequence based computational approaches for the determination and prediction of specificity of PDZ domains [18,24,25]. However, many of these approaches often group PDZ domains from several species into one category, or even worse, group amino acids into pseudo categories, thus giving less focus on features that are responsible for PDZ domain classifications. Most of the published works are focused on predicting various peptide interactions, with varying levels of success. Two of the most promising works in this field have stemmed from Mac-Beath lab [25,26]. This lab initially designed a multidomain selectivity model that implied large promiscuity of many PDZ domain classifications which are instead optimized for selectivity in order to reduce cross-reactivity throughout the mouse proteome. Although the initial correct prediction rate was only 48%, and correct negative prediction for those domains that do not interact at 88%, they built on this model as a precursor for their future work. However, they also focused less on multi species domain structure function relationships and feature extraction techniques. Ultimately, their approach using these [26] previously available results extended the model into a position-specific scoring matrix based on the primary sequence of both the 82 mouse PDZ domains and 93 peptides encoded in the mouse proteome [27]. Another report in the computational analysis using primary sequence data for the classification and prediction of binding interactions of PDZ domains was published by Kalyoncu et al [28]. In this work, they approached the challenge from an engineering perspective and focused more on feature vectors and extraction methods. For their study and analysis, they used both bigram and trigram frequency extracted from PDZ domain amino acid primary sequences. Next, they coupled with a random forest classifier for machine learning in order to build their model. Their results were impressive, showing correct classification accuracy at above 90% when using trigrams for classification. They obtained similar results in predicting PDZ domain interactions. To solve the issue with their imbalanced dataset, they used resampling with replacement, which proved to be optimal. However, a major drawback to their study is the grouping of amino acids into categories, diluting how specific amino acid can play a role in PDZ domain classification. To add to PDZ domain classification complexity, Tonikian et al. [29] developed a PDZ domain specificity map. They showed that PDZ domain classification can be categorized into 16 unique classes from a database of only 72 PDZ domain members. Furthermore, they were able to introduce the idea of domain mutational effects in the peptide binding prediction. Others like Shao et al. describe regression based methods for predicting PDZ domain interactions from primary sequences [30]. It is one of the few methods that try to computationally quantify the strength of peptide-PDZ binding, instead of generating a simple prediction of binding [15]. With this approach, they were able to generate area under curve (AUC) at 0.88 using 23 different PDZ domains. Likewise, Wiedemann et al. [31] sought after a quantitative model for three PDZ domains, trying to develop parameters for ligand affinity prediction within a rational design framework of high affinity binders. Similarly, Roberts et al [32] proposed a novel algorithm for predicting binding affinity from complexes of PDZ domains and small peptides.
The purpose of our work is mainly focused on PDZ domain classification aimed at understanding the critical features and sequences motifs responsible for correct classification. In this study, we propose a novel method for PDZ domain classification by using only the primary amino acid sequence information from 115 unique human and mouse PDZ domains. Our goal was to develop novel techniques and approaches based on computational methods that would allow for the best feature extraction responsible for PDZ domain classification relying on natural amino acid sequences and their properties. In specific, the basis of our feature extraction and encoding is represented in the 7 unique physiochemical characteristics of each of the 20 naturally occurring amino acids. Unlike other groups, we avoid using pseudo-amino acid designation in our classification and feature extraction [28]. We have used bigrams and trigrams of the 20 amino acids composing the PDZ domain primary sequences in combination with 7 physiochemical properties of amino acids. We focused on the key elements of occurrence and the existence, as these features are of critical importance to wavelet transformation and later Shannon entropy used in our classification. The reason for this was to determine which of these concepts could be more essential for classification, which we represented through spectrograms. Based on this, we have built our feature extraction methods.

Dataset
For PDZ domain class classification, we retrieved a total of 115 PDZ domains from human and/or mouse for classification and categorization. The beginning and end of PDZ domain designations were made in accordance with published UniProt annotations, or as they were reported in retrieved datasets [28,29,33]. The dataset consisted of 78Class I, and 38 Class II domains. The dataset was based on PDZ domains retrieved from PDZBase [33], and publications by Kalyoncu et al. [28], Tonikian et al. [29] and Stiffler et al. [26]. The full database containing the PDZ domains, their reported classification and primary sequence are provided in the supplementary information as an excel sheet (S1 Data). However, in brief, for the purpose of classification, PDZ domains are denoted as Class I and Class II in accordance with established categorizations; specifically, if the C terminal sequence of the PDZ domain interacting ligand had the [Ser/Thr-X-F-COOH] motif, it was denoted as a Class I PDZ domain and if the ligand motif consisted of [F-X-F-COOH] amino acids at the C-terminus, the PDZ domain was denoted as Class II, respectively. It is important to note that since many high-throughput PDZ screening reports have been published, often a multitude of binders have been identified for a single PDZ domain. Therefore, the basis for inclusion into a specific PDZ domain class was based on the major dominant ligands identified and the motifs they represent, following the classification as mentioned above. Consequently, even if an odd peptide binder not fitting the designed classification had been reported for a given PDZ domain, it was neglected.
C-terminal interactions [1,11]. Most empirical methods for determining the specificity and affinity of PDZ-ligand bind pairs are time consuming and expensive, making advanced computational method necessary.
The conserved structure of PDZ domains corresponds to frequency, position and type of amino acids present in the domain fold [1,11]. In this study, a novel classification of PDZ domains based on primary sequence is proposed. The suggested recognition approach is based on either frequency or existence of two consecutive amino acids (bigram) and three consecutive amino acids (trigram) in the primary sequence. The mentioned combinations can play a crucial role in the way PDZ domains assume their three dimensional fold, thus leading to their function and ligand interaction. Therefore, by identifying the main factors and amino acid combination present in a given primary PDZ domain sequence, we can imply the domain classification. By identifying the driving factors and features behind the PDZ domain structure, we can imply the classification of the individual PDZ domain as it has been previously suggested in studies that correlate differences between Class I and Class II PDZ domains [28]. Using these principles, PDZ domain primary sequences were rearranged based on the following three methods: 1) Occurrence rearrangement method (ORM): this method is based on bigram/trigram occurrence probability in the primary sequence. For instant, '' NGDLD. . .'' primary sequence will be rearranged as bigrams ''NG'', ''GD'', ''DL'', and ''LD'' or as trigrams ''NGD'', ''GDL'' and ''DLD''. The rearranged PDZ domain sequence will be prioritized by individual bigram/trigram occurrence with those bigrams/trigrams appearing the most given first priority. Next, these bigrams/trigrams will be derived with the corresponding physiochemical characteristics for each amino acid present (Table 1) multiplied by the number of the bi/trigram repetition. These characteristics will be used as the main features for each PDZ domain sequence. Fig 1 illustrates the feature extraction approach by bigram/trigram rearrangements with ORM.
2) Existence rearrangement method (ERM1) is structured such that the priority in PDZ domain sequence rearrangement is based on the existence or non-existence of an individual bigram/trigram from all possible combinations constructed with the 20 natural amino acids. Table 1. Physiochemical parameters of the twenty amino acids using the single letter codes. Therefore, we will screen a single PDZ domain primary sequence for the presence of an individual bi/trigram from all possible combinations, as shown in the Fig 2. We will utilize the same screening protocols using all possible trigram combinations for their presence in a given PDZ domain sequence. Again, as in the previous method, we substituted each amino acid in the bi/trigram with the seven physicochemical properties [34]. These physiochemical properties included isoelectric points, volumes of side chains, hydrophobicity, solvent accessible surface area, polarizability, and electron-ion interaction potential (EIIP) of amino acids that expresses the average energy states of all valence electrons, as depicted in Table 1.

Parameter/Amino Acid
3) Existence rearrangement method (ERM2) is same to ERM1; however, in this method, we perform an additional step of wavelet decomposition and entropy for each property from the seven physicochemical properties individually, but then combined the result into one vector.
In Fig 3, we illustrated spectrograms for bigram and trigram rearrangement features by existence rearrangement method, calculated for four PDZ domains from class1 and different four PDZ from class2. We can notice that features vary widely from class to class, but also have  correlation with other features in the same class. However, it is important to correlate back to the distinctive spectrograms fingerprints and feature back to the PDZ domain sequence in order to recognize the key amino acid contributing to PDZ classification.
For better representation of the extracted feature, we use WPT. Wavelet packet transform decomposes the signal into different sub signals of different bands of frequency [35,36] as: where c h is the WPT sub signal, h = 1,2,. . .H, where H is the number of wavelet packet transform sub signals dependent on the WPT level; in this study, six levels of (J) are used.
In general, entropy is a popular theory utilized in many scientific communities, primarily in the areas of signal and image processing. In the case of classical entropy-based criterion, precise representation for a given image is described by information-related properties. In these entropy-based methods, information related to the concentration of the image is retained, making it Application of Wavelet Transform for PDZ Domain Classification popular amongst image processing approaches [37]. Methods measuring entropy are enormously beneficial tools for quantifying the ordering of non-stationary processes. For dimensionality reduction into higher information concentration, Shannon entropy is calculated for each WPT sub signal, as follows: where s is the input signal, and s i are the coefficients of s in an orthonormal basis [38]. The obtained entropy results vector F is the feature vector output: Here N is the number of elements in the feature vector F which equals to the number of WPT sub signals. Therefore, in the case of bigram ERM2, we decrease the number of features obtained from 2800 to 128.

Results and Discussion
In order to find out the optimal PDZ domains classification approach, different objective methods are performed. The three suggested feature extraction methods ORM, ERM1, and ERM2 in conjunction with WE (ERWE2) are examined. For PDZ domain classification, support vector machine (SVM), probabilistic neural network (PNN), and K-nearest neighbors (KNN) are utilized. Bigram, trigram and fusion (both of bigram and trigram in the same feature vector) of each PDZ domain primary sequence are investigated. Table 2 shows the results of PDZ domains classification in terms of recognition rate (RR), which is the number of properly classified test samples over the total number of testing samples. The whole database comprising of 115 unique PDZ domains is utilized. We select 57% of PDZ domains from each class for training. Our RR results are calculated as an average of 1000 different combination training and testing sets. In addition, we perform confidence interval measurements representing recognition rates at each step and iteration as depicted in Fig 4. The role of confidence interval is to give an estimated range of possible values including an unknown population parameter, a range estimate for a given set from the sample data. A confidence interval for the recognition rates of the sets combination mean value μ and Standard deviation σ are based on samples size n, therefore, where C is the critical value for a 95% confidence interval (1.96) [39,40]. Confidence interval results were calculated and reported; specifically, the confidence interval states that 95% of the calculated recognition rate for each combination should be contained in this interval. A wider confidence interval would represent an improper dataset or a database that is unsuitable for performing feature extraction. In our study, all intervals calculated for each method are within a reasonable range. As observed, the best RR is achieved by means of ERWE2, particularly BERWE2. This is the best feature extraction method, showing the highest RR and using KNN as the PDZ domain classifier. Alternatively, bigram ERM1 (BERM1), trigram ERM1 (TERM1) and fusion ERM1 (FERM1) perform relatively well, making them promising approaches; however, they require further investigation. This leads to a very critical observation related to the feature extraction methods, and the way to make and prioritize bigram or trigram rearrangements. It is clear that the two methods based on the concept of the existence or presence of a given bigram or trigram in the primary sequence is better suited than the method based on occurrences or repetition of each bi/trigram. Furthermore, the results presented in Table 2 demonstrate that ERWE2 outperforms ERM1 in terms of RR, as well as in term of feature vector length; 2800 for BERM1 and 128 for BERWE2.
For more in-depth analysis of the results achieved in Table 2, BERWE2, TERWE2 and FERWE2 are tested and compared further. In order to compare the obtained results, we performed recognition sensitivity (RS) [40] analysis, which was defined as the following: where ρ XX is the correlation coefficient determined for two PDZ domain feature vectors belonging to the same class and ρ XY is the correlation coefficient determined for two PDZ domain feature vectors belonging to two different classes. The results representing the recognition sensitivity were determined for 30 combinations of different PDZ domains indicating that these three methods are similar and comparable. However, BERWE2 approach did provide slightly better results. Comparison between the results is illustrated in Fig 5. Based on the RR results in Table 2 and RS illustrated in Fig 5, we can conclude that the bigram method with ERWE2 is the better when compared to trigrams and fusion methods alone. In general, ERWE2 is shown to be a very promising PDZ domain feature extraction method.
Also examined was the effect of training data quantity on RR results. We use 50%, 57%, 63% and 68% of the database for training and the remaining for testing by means of BERWE2 (data not shown). Results using three classifiers were consistent and dependent on the number of training PDZ domains. KNN approach consistently gave the highest classification percentage.
For more detailed investigation of our results, we applied sensitivity, specificity and positive predictability measures for BERWE2. These measures are defined as following:  Table 3 tabulates these three measures, where RR is calculated as the mean value of the three measures. Three published feature extraction methods, amino acid composition (AAC) [41], pseudo amino acid composition (PseAAC) [42] and the auto covariance (AC) method [43], are used for comparison investigation. The results are based on the results published previously by [34].  Table 4 tabulates TPR, FPR and ACC, the results of the several feature extraction methods using primary sequence construction that based on different 20 amino acids arrangements. BERWE2 with KNN method is performed by providing 57% of the 115 PDZ domains for training. Although the FPR of the PseAAC (12.28%) is the lowest, its TPR (69.45%) is very low. Overall, the ACC rate (85.19%) obtained by of BERWE2 with KNN method is the highest. However, the TPR (67.95%) obtained by ACC method is the smallest, whereas, TPR (87.50) by BERWE2 with KNN is the best.
Cross-validation processes the recognition performance in a constant method by excluding a few instances (10% for 10-fold cross-validation) to be used as the test set during the training process [28,34]. To evaluate our algorithm through another validation technique, we conduct a 10 fold cross-validation test. The test results (with average 81.41) verify that our proposed method is stable and not sensitive to the validation technique.
The proposed method (i.e. BERWE2) was further compared to two more PDZ centric classification methods. The first method is the interaction classification of PDZ domain (ICP) reported by Kalyoncu et al. [28]. This method used the frequencies of consecutive trigram and bigram in the primary sequences as features multiplied by the corresponding physiochemical characteristics for each amino acid (Tab.5). The 20 amino acids were clustered into seven different classes according to their dipoles and volumes of the side chains which reflect their interaction specificity. Then for bigram 49, (7×7) features were obtained. The second method is the modified BIRS algorithm (MBIRS) reported by Nakariyakul et al. [34], which is a hybrid of the minimal redundancy-maximal-relevance (mRMR) algorithm, and the original best incremental ranked subset (BIRS) algorithm to select relevant and irredundant features for PDZ domain. Then, 215 features were obtained. KNN classifier was used for classification. The classifier was trained by using a 10-fold cross-validation by utilizing our database. The results tabulated in Table 5 show the recognition rates among above two methods compared with BERWE2. The results indicated that the proposed method is superior with a highest recognition rate of 81.41%. Critical feature and sequence motifs We wanted to better understand how the selected features help distinguish the two different PDZ domain classes and why they might be important to their specificity. The features extraction method that performed best was based on bigrams (BERWE2). As it was expected, we identify the bigrams that make up the conserved GLGF motif as an important feature. This site is located between βA-βB loop and αB helix and has been well established as one mode of peptide binding and selectivity for all PDZ domains. In order to identify other important motifs, such as the GLGF, we assessed the probability of each bigram appearance. We can see in Fig  6A and 6B a heat map representing the probability of each of bigram from all of the 400 possible combinations. Our feature extraction points to the high probability of bigrams features such as 'G-F' (stemming from the GLGF motif) and 'F-F' and in particular bigrams of IL, GL, GI, and GD for PDZ domains. Additionally, for Class I PDZ domains, it was also noted that the 'TH' bigram has a 48% probability of existence relative to Class II PDZ domains. This is expected as the Histidine often makes hydrogen bonding contacts with Serine or Threonine in position P-2 of the ligands of Class I PDZ domains. Additionally, the appearance of the LG, LQ and LK bigrams for Class II PDZ domains is noteworthy. We can see that these features are propagated and can be visualized with spectrograms ( Fig 6C) representative of Shannon entropy of WTP composed of the feature vector based on the seven physiochemical characteristics of amino acids. Even though, positional information is hidden in the spectrograms of Fig 6C we start notice some broad similarities for sequences belonging to Class I (6C left) versus Class II (6C right) PDZ domains. In our approach, the benefit of Shannon entropy of WTP is that it helps us perform feature selection leading to dimensional reduction, which in our case is down to 128 features. We hypothesized that if these determined critical feature motifs based on Fig 6 are indeed critical for classification, their effect would be more evident due to the reduction of feature space. We aimed to visualize these differences aided by spectrograms of Shannon entropy of WTP in frequency domain, as a result of mutating the critical feature motifs. In Fig 7A, the top panel represents two primary amino acids sequences (one for a Class I and one Class II PDZ domain) shown with bolded bigrams where manual mutations were performed (making the least pronounced changes such as G!A and I!L). As it can be observed in the bottom panel of Fig 7A, the spectrograms are significantly different when these mutations are introduced, highlighting their importance. It is almost as if there is a class switching between the two sequences when the mutations are introduced. This demonstrates that the spectrograms can be a used tool to quickly assess effects on PDZ domain classification. In Fig 7B and 7C we highlight some of the critical feature and sequence motifs and how they might play a role in PDZ domain interactions. Specifically, we looked at α1-syntrophin PDZ as an example of a Class I PDZ domain. We wanted to map where on the 3D structure the bigrams that we identified as critical motifs are located and correlate this to the published function of the PDZ domain. It was previously Comparison between two published methods [28,34] in terms of their recognition rate performed on the dataset used in our current study. doi:10.1371/journal.pone.0122873.t005 reported [44] that LG and GI bigrams corresponding to positions 12-14 in the present sequence undergo a chemical shift with peptide binding. Furthermore, these residues along with Ile15 and Leu68 comprise the binding pocket for the C-terminus of the peptide. In addition, the "TH" bigram is involved in hydrogen bonding with peptide ligands, providing additional recognition specificity. These important motifs are highlighted in red in the 3D structural rendering in the left panel of Fig 7C. Similarly, in the right panel of Fig 7C, the structure of PDZ2 of Harmonin is shown with the critical motifs highlighted in red. As was previously published [45], the GL and LG bigrams that comprise the binding pocket and the GLGF motif are key to ligand recognition, which we determined. Additionally, for Class II PDZ domains LG, LQ and LK, motifs are also critical. The last bigram LK corresponding to the position 71 and 72 in the present sequence are also involved in the stabilization of the ligand pocket and important to PDZ domain classification. However, our approach also highlights other bigrams that don't appear directly in the binding pocket. We hypothesize that these could be long-range networks

Summary
This work studied classification methods of PDZ domain by examining the primary sequences. Three feature extraction methods were proposed. For classification, three known methods were utilized. For better representation of the extracted feature, we used WPT to decompose the signal into different sub signals of different bands of frequency. Shannon entropy was calculated for each WPT sub signal. Thus, we could decrease the number of features obtained BERM1 down to 128. Three methods for feature extraction were tested, one based on the occurrence rearrangement method and the other methods based on the existence rearrangement method. Existence rearrangement was better than occurrence rearrangement in terms of Application of Wavelet Transform for PDZ Domain Classification recognition rate. We found that the classification performance of the 115 PDZ domains by ERM2 is better, though not significantly, than those of the two other feature extraction methods. The method ERWE2 based on KNN classifier successfully achieved the recognition rate of 78.34% and the ACC value of 81.92%. For comparison purposes, our proposed method outperformed the three other published methods under similar test parameters. Overall, these types of analyses of sequence space can be done relatively quickly and cheaply in comparison to structure based methods. Specifically, it is important to highlight the time and experimental cost associated with structure based methods allowing for a lesser number of perturbations that can be tested over similar time periods. Although PDZ domains show highly selective interaction pattern, our results indicates with high accuracy that our current classification approach is highly correlated to previously published reports and known classification pattern of PDZ domains. Specifically, our wavelet based approach was able to extract important sequences characteristics and features of PDZ domains. By testing several different methods, we show that the bigram based technique performs the best for classification; from there on, we focused on assessing how these feature can be used to obtain critical motifs. These critical motifs were indeed important and present in the conserved GLGF repeat. Also, we identified the 'TH' bigram and specifically Histidine as crucial, which has been shown to make hydrogen bonds with Serine/ Threonine of the ligands for Class I domains and for Class II the bigrams of LG, LQ, LK which are often located on αB and also one of the parts of binding pocket. We will concentrate more about the direct concept of position in the future work.
Supporting Information S1 Dataset. PDZ domain dataset. The first column lists PDZ domain containing proteins given as their common abbreviation. In parenthesis PDZ domain number and sequence in case multiple domains per protein are listed. Next column designates the class of PDZ domain (i.e. either Class 1 or 2). Next column represents the amino acid sequences of the PDZ construct used in our analysis. These boundaries are not necessarily the same as those defined by computational domain identification methods. The domain source (i.e. human or mouse) is designated in the final column. (XLSX)