Computational method for aromatase-related proteins using machine learning approach

Human aromatase enzyme is a microsomal cytochrome P450 and catalyzes aromatization of androgens into estrogens during steroidogenesis. For breast cancer therapy, third-generation aromatase inhibitors (AIs) have proven to be effective; however patients acquire resistance to current AIs. Thus there is a need to predict aromatase-related proteins to develop efficacious AIs. A machine learning method was established to identify aromatase-related proteins using a five-fold cross validation technique. In this study, different SVM approach-based models were built using the following approaches like amino acid, dipeptide composition, hybrid and evolutionary profiles in the form of position-specific scoring matrix (PSSM); with maximum accuracy of 87.42%, 84.05%, 85.12%, and 92.02% respectively. Based on the primary sequence, the developed method is highly accurate to predict the aromatase-related proteins. Prediction scores graphs were developed using the known dataset to check the performance of the method. Based on the approach described above, a webserver for predicting aromatase-related proteins from primary sequence data was developed and implemented at https://bioinfo.imtech.res.in/servers/muthu/aromatase/home.html. We hope that the developed method will be useful for aromatase protein related research.


Introduction
Cancer cases continue to rise globally despite advances in clinical therapy [1]. Breast cancer remains the most frequently diagnosed cancer in females and metastasis remains the leading cause of death by this cancer [1]. Breast cancer incidence is greater in developed countries, while mortality is highest in developing countries [2]. About 30% breast cancer patients develop recurring metastatic cancer despite recent advances in therapeutic regimens.
Biological actions of estrogen are mediated with the estrogen receptor (ER) and 70% of breast tumors express the ER and/or progesterone receptor (PR). Thus, estrogen deprivation has been considered an important treatment for estrogen-dependent (ER+) breast cancers. In post-menopausal women, estradiol is produced in extragonadal sites and thus it stops functioning as a circulating hormone and acts locally as a paracrine or intracrine factor [3,4]. These peripheral sites include the mesenchymal cells of adipose tissue, osteoblasts Therefore, we have made a concerted effort to develop a method for identifying aromataserelated proteins. We developed a method for recognizing enzymes that will aid in the identification of new or unknown aromatase-related proteins, using amino acid composition (AAC), dipeptide composition (DPC), hybrid and position-specific scoring matrix (PSSM) models.

Machine learning based support vector machine (SVM)
Amino acid composition (AAC), dipeptide composition (DPC), PSSM profile and Hybrid approach employing machine learning based support vector Machine (SVM) were used to construct the method. The SVM-based prediction technique is often used to manage vast amounts of data, and it has been demonstrated to perform well in a number of biological data processing applications such as classification, protein functions and type identification [29][30][31]. In this study, we used SVM to analyze the performance of the classifiers and five-fold cross validation [32][33][34]. The generated approach model's performance was assessed using the original and additional protein datasets. To eliminate outcome bias, all models were run with the same amount of negative sequences. Based on the size of the aromatase dataset, negative sequences were picked at random from the UniProt database. The performance of the SVM models was tested using known positive and negative sequence data. A blank dataset was also utilized to test the generated models, which successfully recognized the data.

Generation of survival curves
Kaplan-Meier (KM) plotter is a web-based survival analysis tool and evaluates correlation between the expression of all genes (mRNA, miRNA, protein) and survival in about 30k+ samples from all tumor types. GEO, EGA, and TCGA are the sources for the databases and the plotter provides a meta-analysis based discovery and validation of survival biomarkers for cancer research [35]. The KM plotter tool (http://kmplot.com/analysis/) was used to determine the prognostic value of aromatase (CYP19A1) mRNA expression using Pan-cancer RNA-seq in various cancers by correlating it with overall (OS) and relapse-free (RFS) survival [36], for a follow-up threshold of 240 months. For mRNA expression analysis, samples were split into high and low expression groups based on the median expression of aromatase. The median expression was selected to split patients over other options of lower quartile, lower tertile, upper tertile and upper quartile expression to give almost same sample numbers for both groups and hence less bias. Hazard ratio (HR), 95% confidence intervals and logrank p for all the survival curves were provided by the KM plotter website and p value of < 0.05 was considered to be statistically significant.

Datasets for SVM
Aromatase data was taken from the Uniprot/SWISSPROT database [37]. When we used the keyword, we found 9836 protein sequences which included 257 reviewed sequences. So, we used only reviewed sequences retrieved on 10th May 2021, and removed all these sequences annotated or labeled as "fragments," "isoforms," "potentials," "similarity," or "probables" to generate a high quality dataset and this removal will help in reducing the prediction error. To avoid redundancy and the incorporation of variants, this dataset was then processed with the CD-hit tool, which deleted sequences that were more than 90% identical to any other sequence in the dataset [38]. The final dataset contained a total of 191 aromatase sequences (positive dataset) out of 257, details provided in the S1 File. The negative dataset contained 191 non-aromatase sequences that were unrelated to the aromatase and were picked at random. A Uniprot/Swissprot keyword search for "regulatory proteins" was used to select the negative sequence collection. A web server for predicting aromatase-related proteins from primary sequence data was developed and implemented at weblink https://bioinfo.imtech.res.in/ servers/muthu/aromatase/home.html.

Amino acid and dipeptide composition
The amino acid composition of a protein refers to the percentage of each amino acid in the protein [21,39]. Encoding data into vectors is required by the SVM light. The percentage of all 20 natural amino acids was calculated using the following equation: In a similar manner, dipeptide composition was calculated using a vector with a constant length of 400 (20x20) dimensions [40]. To determine the fraction of each dipeptide composition, the following equation was used:

PSSM profile
The GPSR software was used to create the PSSM profile against the nr (non-redundant) blast database. We utilized the seq2pssm imp, pssm n2, pssm comp, and col2svm programmes in the GPSR package for PSI-BLAST searches against the nr database using different iterations with a cut-off e-value of 0.001, as well as to normalize the PSSM profile and produce the SVM light input format (i.e. as a composition vector of 400) [26]. Finally, the SVM models were created with various parameters, optimized, and the best model was employed in the prediction server. For normalization, the following formula was used:

Hybrid approach
In order to improve prediction accuracy, a hybrid technique was developed. A hybrid model is defined as the combination of two or more profiles. The hybrid models were developed using 420 vector lengths, which included 20 and 400 from AAC and DPC, respectively. The col_add function in the GPSR 1.0 package's was used to merge the AAC and DPC profiles to generate a hybrid profile [41,42].

Evaluation and performance
A five-fold cross validation approach was used to evaluate performance. We started with an aromatase positive dataset and a non-aromatase negative dataset. Positive and negative datasets were randomly divided into five equal groups. In order to run SVM, four sets were utilized for training and the remaining set for testing. This process was performed five times, resulting in only one test for each sub-set [22,43]. This has been done with all approaches, including amino acid, dipeptide, PSSM, and hybrid. The average of the test scores from all five sets was used to compute the final performance. The performance of the classifiers was assessed using sensitivity, specificity, accuracy, and the Mathew correlation coefficient (MCC). These measurements were calculated using the following standard formulas: MCC ¼ TP X TN À FP X FN ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi

Support vector machine (SVM)
Aromatase prediction was done with the SVM light programme, a very successful machine learning approach. The SVM-light has been used in a variety of investigations, including plasminogen activator prediction, BacHbpred-bacterial hemoglobin prediction, Oxypred-oxygenbinding protein prediction, and VerHb-vertebrate hemoglobin protein prediction [21,26,[39][40][41][42]. The SVM may employ a range of parameter settings, including kernel, linear, polynomial, and radial basic functions (RBI) [44]. We optimized distinct parameters for each prediction approach in the prediction studies. In the method, aromatase was utilized as a positive example and non-aromatase was used as a negative example. In practice, we ran SVM light with (+)ve labels for positive sequences and (-)ve labels for negative sequences.

Webserver
The aromatase related protein prediction webserver was developed using HTML and CGI--PERL script. The backend was connected to the apache server utilizing the linux operating system. The prediction webserver can be accessed freely at the following weblink https:// bioinfo.imtech.res.in/servers/muthu/aromatase/home.html. It is a Support Vector Machine (SVM) based classification method for predicting aromatase-related protein. The user can paste their sequences in fasta format into the text box on the submit page. This server will predict the input sequences as aromatase or non-aromatase protein, based on the selected approaches-amino acid composition (AAC), dipeptide composition (DPC), PSSM and hybrid (AAC+DPC).

Effect of aromatase mRNA expression on cancer patient's survival
KM plotter Pan-cancer RNA-seq was used to analyze correlation of aromatase (CYP19A1) mRNA expression and survival in different available tumor types (

Amino acid composition analysis
The amino acid composition of aromatase sequences was computed for aromatase proteins, and it was observed that residue "L" occurs at much greater frequencies (above 10%) ( Fig  2A). As shown in Fig 2A, "F", "P", "S" and "V" are present more than 6%. The residues "C" and "W" are shown less than 2%. When comparing the amino acid residue profiles of aromatase and non-aromatase, some of the residues pattern are similar, but not all (Fig 2B). These differences can be used to identify the aromatase from negative sequence by the developed models.

Amino acid composition SVM modules
Firstly we used support vector machines (SVM) to develop models based on the amino acid composition of aromatase. SVM was trained on a variety of datasets using the SVM light implementation. A 20-dimensional amino acid composition vector was used to train the SVM classifiers. SVM Kernels and parameters were adjusted for the best discriminating between positive and negative protein sequence data sets. The maximum accuracy (ACC) of aromatase prediction based on amino acid composition was 87.42%, with 100% sensitivity (SN), 74.84% specificity (SP) and 0.87 Mathew correlation coefficient (MCC) (Table 2, Fig 3).

SVM modules using dipeptide composition
In general, SVM algorithms based on dipeptide composition are more effective than approaches based on single amino acid composition. SVM classifiers for dipeptide composition have also been constructed, which is represented by a 400-dimensional vector of dipeptide frequencies (20 x 20). During the adjustment of the kernel parameter and trade-off parameter C, better prediction performance was found with γ = 3 and C = 375. We developed models to distinguish aromatase from non-aromatase sequences based on these parameters. The SVMbased model achieved a maximum accuracy of 84.05%, 99.84% sensitivity, 68.26% specificity and 0.82 MCC as shown in Table 2 and

Hybrid (AC + DC) SVM modules
The aromatase prediction problem was also addressed using a hybrid prediction approach that integrated amino acid composition (AAC) and dipeptide composition (DPC). The hybrid approach yielded 85.12% accuracy, 98.68% sensitivity, 71.55% specificity and 0.83 MCC respectively (Table 2, Fig 3). The hybrid model results are slightly improved than the individual models, the hybrid model increase sensitivity while decrease in specificity, resulting in a slight improvement in overall performance.

PSSM profile based SVM modules
Aromatase prediction models based on position specific score matrix (PSSM) profiles were also developed to improve the performance, and they achieved maximum accuracy of 92.02% with 100% sensitivity, 84.05% specificity and 0.92% of MCC (Table 2, Fig 3). In general, all models, including the simple AAC method, performed comparably well as measured by accuracy and MCC.

Prediction scoring graphs analysis
Prediction scoring graphs were also used to assess the performance of SVM modules. The prediction score for each individual sequence tested is represented by the scoring graph, which shows how the score of sequences in the positive set is separated from the score of sequences in the negative set by a threshold that may be used to categorize positive and negative predictions. However, not all positive or negative sequences are successfully categorized, leading to misleading negative and positive predictions. This analysis summarizes the prediction results to reflect this element of performance. According to our study's findings, no positive sequences predicted negatively in AAC, whereas one negative sequence predicted positively (Fig 4A). In DPC, no positive sequences predicted negatively and no negative sequences predicted positively (Fig 4B). In hybrid, three positive sequences predicted negatively while one negative sequence predicted positively (Fig 4C). One positive sequence predicted negatively whereas the one negative sequence predicted positively in the PSSM system ( Fig 4D). On the negative dataset, the predicted false positive rate (FPR) in AAC, Hybrid was 0.005, and in PSSM 0.010.

BLAST data analysis
According to the results of the BLAST dataset, the developed methods are performing well in all approaches in identifying aromatase. We have randomly picked five sequences from our dataset (CP19A_HUMAN, CP2F1_HUMAN, CP4Z1_HUMAN, GCM1_HUMAN, and CP2A7_HUMAN) and BLAST was performed against non-redundancy dataset and collected 500 sequences (100 each from one sequence). Overall, the proposed method using the BLAST dataset was able to accurately identify 97.4% of the sequences in all approaches. All models correctly predicted the respective individual performances of AAC, DPC, Hybrid and PSSM at 99.2%, 93.8%, 96.6% and 100% (Table 3). Thus, the PSSM approach completely identifies the BLAST sequences (Fig 5). This result shows that our method outperforms the BLAST search in identifying the aromatase related proteins.

PLOS ONE
Computational method for aromatase-related proteins

Discussion
Computational biology has helped understand proteins from a new perspective, as algorithms can predict protein-protein interactions [45,46] and identify novel drug targets in various pathologies [47,48]. Algorithms performing systematic study of cancer and protein databases [49,50] have enhanced the accuracy of cancer patients' survival predictions [51][52][53][54], provide understanding of drug-induced side-effects [55] and allow identification of novel biomarkers [56]. To our knowledge, there are no algorithms for structural and functional characterization of aromatase or its polymorphisms. As aromatase is a critical target in breast cancer patients [57,58], we established a reliable approach for detecting novel aromatase-related proteins, which will aid in developing novel AIs with improved efficacy. Aromatase belong to the cytochrome P450 family, which are heme-containing mono-oxygenases and highly flexible enzymes that allow easy substrate access and binding, and product release [59]. Unlike most P450s, which are not highly substrate selective, androgenic specificity of aromatase sets it apart. Aromatase structure remained unknown for decades and this hindered explanation of its biochemical mechanism. Several laboratories purified aromatase from human placenta [60,61] and recombinant expression systems [62,63], however attempts to crystallize aromatase remained unsuccessful. So far, only one crystal structure of the only natural mammalian, full-length P450 human placental aromatase is known [64]. Thus, finding aromatase-related proteins using in-vivo and in-vitro methods is difficult and thus low-cost computational methods like SVM can be a reliable approach to identify novel aromataserelated proteins. Aromatase is the only vertebrate enzyme which catalyzes aromatization of androgens into estrogens [64,65]. It is a monomeric integral membrane protein in endoplasmic reticulum [66,67] and has a heme group with 503 amino acids. Aromatase has twelve α-helices and ten β-strands [64,68] and its active site is a distal cavity of heme-binding pocket with heme iron being the reaction center [68]. Aromatase in peripheral adipose tissues leads to estrogen biosynthesis in postmenopausal women, thus inducing breast tumors [69]. A small amount of estrogen can stimulate breast tumor formation and aromatase protein is seen in epithelial as well as stromal breast cancer cells [70]. AIs are currently being used to treat breast cancer patients, however resistance and toxicity of AIs induces the need for discovering novel AIs [71].
Survival analysis in various types of cancer patients using KM plotter showed that aromatase higher mRNA expression led to poorer overall survival (OS) in head-neck squamous cell carcinoma (Fig 1A), kidney renal clear cell carcinoma (Fig 1B), kidney renal papillary cell carcinoma (Fig 1C), liver hepatocellular carcinoma ( Fig 1E) and stomach adenocarcinoma ( Fig  1F) patients. Human fetal liver, kidney and intestine expresses significant level of aromatase [72], but the hepatic aromatase expression becomes untraceable in post-natal life [73]. Estrogens have shown to promote not only the development and progression of breast cancer, but also endometrial, prostrate and colorectal cancer by increasing the mitotic activity [74,75]. The current survival analysis suggests a key role of aromatase as a tumor-promoter, even in extragonadal tissues including head-neck, kidney, liver and stomach [76]. These results signify the demand for a method to identify aromatase-related proteins for various types of endocrine-responsive tumors.
SVM is used in a variety of studies in the field of basic science and medicine, including clinical data analysis, laboratory testing for detection of disease and clinical trials of medicines [77][78][79]. In this study, we developed a very reliable method for predicting aromataserelated proteins, based on a variety of protein patterns such as AAC, DPC and Hybrid approaches. The overall prediction accuracy for aromatase-related proteins was 87.42%, 84.05%, 85.12% and 92.02% for AAC, DPC, hybrid and PSSM, respectively. The results of the BLAST search data analysis and prediction score graph analysis demonstrate that the established method is effective in identifying the aromatase-related proteins. We expect that our developed method will find undiscovered aromatase-related proteins, which will aid researchers in cancer predictive studies and precision medicine. As it is a first webserver to detect aromatase-related proteins, we cannot compare the performance of our method with any other methods.

Conclusion
So far, there is no web-server/algorithm to predict or detect aromatase-related proteins. Thus, we developed a highly accurate method for identifying aromatase-related proteins using SVM with various amino acid approaches (Fig 6). The method was developed with the fivefold cross validation techniques with the approaches of amino acid composition (AAC), dipeptide composition (DPC), hybrid (AAC+DPC) and position specific score matrix (PSSM). We have tested the known and unknown data with our developed models and as a result all models detect aromatase-related proteins accurately. In future studies, we would like to work on the aromatase inhibitors with molecular docking, and we are also interested in using a deep learning technique [80][81][82]. We believe that this study will facilitate researchers in finding new or undiscovered aromatase-related proteins.