Identification of average molecular weight (AMW) as a useful chemical descriptor to discriminate liver injury-inducing drugs

Drug-induced liver injury (DILI) is one of major causes of discontinuing drug development and withdrawing drugs from the market. In this study, we investigated chemical properties associated with DILI using in silico methods, to identify a physicochemical property useful for DILI screening at the early stages of drug development. Total of 652 drugs, including 432 DILI-positive drugs (DILI drugs) and 220 DILI-negative drugs (no-DILI drugs) were selected from Liver Toxicity Knowledge Base of US Food and Drug Administration. Decision tree models were constructed using 2,473 descriptors as explanatory variables. In the final model, the descriptor AMW, representing average molecular weight, was found to be at the first node and showed the highest importance value. With AMW alone, 276 DILI drugs (64%) and 156 no-DILI drugs (71%) were correctly classified. Discrimination with AMW was then performed using therapeutic category information. The performance of discrimination depended on the category and significantly high performance (>0.8 balanced accuracy) was obtained in some categories. Taken together, the present results suggest AMW as a novel descriptor useful for detecting drugs with DILI risk. The information presented may be valuable for the safety assessment of drug candidates at the early stage of drug development.


Introduction
Hepatotoxicity is one of major causes of drug withdrawal from the market [1][2][3]. It is very difficult to detect liver injury potential in the early stage of drug development since the incidence of drug-induced liver injury (DILI) is very low, that is 1 in 50,000 to 1 in 100,000 patients (0.001% -0.002%) [4]. In addition, molecular mechanisms of DILI are barely understood and thus no valid predictive evaluation system has been established for DILI.
Various factors are associated with the onset of DILI. One of the major causes of DILI is the formation of reactive metabolites, which bind to and denature macromolecules, such as proteins and lipids, to induce cellular dysfunction, inflammatory stress and allergic reactions, and thereby toxicological responses [4][5][6]. It is also known that there are large inter-individual variations in the clinical symptom of DILI, which mainly result from the patient-related factors [7,8]. For example, genetic polymorphisms of genes associated with drug metabolism and immune/inflammatory response have been identified as risk factors for DILI [9], and the involvement of the haplotypes of human leukocyte antigen in DILI has been reported [10]. A variety of in vitro assays have been established to assess the toxicological potency of drugs at the cell level, such as those using a high-content screening (HCS) method [11] or monocytes to detect immune/inflammatory responses [12]. In addition, high-lipophilicity (e.g. logP � 3) [13] and high daily dose (e.g. >50-100 mg per day) [14] as well as the above-mentioned reactive metabolite formation by drug-metabolizing enzymes, such as cytochrome P450s and UDP-glucuronosyltransferases [15], have been suggested as DILI-inducible factors [16,17]. Therefore, a DILI prediction by the combination of HCS and the rule-of-two (i.e. daily dose � 100 mg/day and logP � 3) was tested and it showed improved prediction accuracy and reduced the number of drugs that were subjected to further experimental assessments, compared to a prediction by HCS alone [18]. These combination approaches are practical and employed in many pharmaceutical companies for drug development. However, they may not be applied to the early drug discovery stage since it requires daily dose information and synthesis of candidate compounds.
In silico methods have been applied to the toxicity prediction of chemical substances such as pharmaceuticals [19]. Toxicity predictions based on in silico methods can be performed at lower costs and in high-throughput systems, and they do not require chemical synthesis and in vivo and in vitro experiments. In silico approaches are thus ideal for screening at the early stage of drug development. In these studies, parameters (i.e. chemical descriptors) calculated from physiochemical properties of drugs are used, which can be easily obtained as quantitative data from chemical structures [20,21].
Decision tree-based analysis is a classical classification method of machine learning [22]. While its predictive performance is usually lower than that of other related methods, such as deep learning and random forest, it has an advantage that a constructed model has readability. Moreover, the resulting information on partial chemical structures and physicochemical properties that contribute to toxicity will be valuable for the feedback to medicinal chemistry to reduce toxicity through synthetic development at the early stage of drug development.
The aim of this study was to identify a physicochemical property that is simple to understand and useful for DILI screening in the early stage of drug development. To this end, we investigated chemical properties associated with DILI, using decision tree-based methods with DILI-positive drugs (DILI drugs) and DILI-negative drugs (no-DILI drugs) selected from Liver Toxicity Knowledge Base (LTKB) [23,24]. Chemical descriptors associated with partial chemical structures, lipophilicity, surface charges and others were calculated as indices of physicochemical properties of the drugs and we then sought to identify an important factor(s) to discriminate DILI drugs from no-DILI drugs.

Dataset
The DILIrank dataset, which consists of 1,036 US Food and Drug Administration (FDA)approved drugs, was obtained from LTKB, a project at the FDA National Center for Toxicological Research (https://www.fda.gov/science-research/liver-toxicity-knowledge-base-ltkb/ drug-induced-liver-injury-rank-dilirank-dataset). In the dataset, drugs are divided into four classes, v Most-DILI-concern (192 drugs), v Less-DILI-concern (278 drugs), v No-DILI-concern (312 drugs), and Ambiguous-DILI-concern drugs (254 drugs), according to their potential for causing DILI. Among them, all the v Most-DILI-concern, v Less-DILI-concern and v No-DILIconcern drugs were used as target compounds in this study, and v Most-DILI-concern and v Less-DILI-concern drugs were defined as DILI drugs (total of 470) and v No-DILI-concern drugs were as no-DILI drugs.
The DILIrank dataset contains the lists of drug names and PubChem CIDs. Based on the CIDs, the CAS number of each drug was obtained, and their chemical structures were confirmed and defined using SciFinder (Chemical Abstracts Service, Columbus, OH) and PubChem (https://pubchem.ncbi.nlm.nih.gov). The salt compounds were desalted (e.g. hydrochloride of hydrochloride salt compounds was removed) or neutralized (e.g. sodium carboxylates were changed to carboxylic acids), and peptides, nucleic acids and metal-containing drugs were excluded in this study. The list of final target compounds (432 DILI drugs and 220 no-DILI drugs) is shown in S1 Table.

Chemical descriptors
The SMILES format data of all the compounds used in this study were obtained from Pub-Chem database. The SMILES were converted to MOL format using OpenBabelGUI [25]. The two-dimensional (2D) chemical descriptors were calculated with Dragon 7 software (Talete, Milano, Italy) with MOL format data. The descriptors that were incalculable and those that were constant among all the target compounds were excluded, and the remaining 2,473 descriptors (S2 Table) were used for analyses.

Statistical analysis and machine learning
Statistical analyses and machine learning were performed using Microsoft Excel, JMP Pro 14 (SAS Institute, Cary, NC) and R (version 3.6.2 and 4.0.1) [26].

Chemical space of target compounds
The chemical space of the target compounds (432 DILI drugs and 220 no-DILI drugs) was compared with that of 12,262 chemical compounds that were obtained from PubChem database as references, using principal component analysis (PCA). PubChem contains about 100 million small molecules with fewer than 1000 atoms and 1,000 bonds [27]. For comparison, 20,000 numbers 442,486) were randomly generated and these numbers were used as potential CIDs to search SMILES with PubChem Identifier Exchange Service (https:// pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi). Finally, the valid SMILES format data of 12,262 chemical compounds were obtained. The chemical descriptors of these compounds (12,914 in total) were calculated. Those with zero or near-zero variance were excluded with the R function "nearZeroVar()" in the caret package, and further selection was performed considering linearity among descriptors using the R function "findLinearCombos()" in the caret package. The remaining 1,539 descriptors were subjected to PCA with the R function "prcomp ()" in the stats package.

Discriminant analysis using tree models
To build and validate decision tree models, nested cross-validation was performed as follows: The target compounds were randomly divided into 3 groups keeping the proportions of DILI drugs and no-DILI drugs same with the original proportion of the 652 drugs (432:220). Three trees were constructed using each pair of two groups as a training set and the rest was used as a test set. The decision trees that discriminate between DILI drugs and no-DILI drugs were constructed with the chemical descriptors as explanatory variables using the Classification and Regression Tree (CART) algorithm with the R function "rpart()" in the rpart package [22]. To determine the depth of each tree model, 5-fold cross-validation was carried out.
The final model was independently constructed with 5-fold cross-validation using the chemical descriptors of the 652 drugs. As the indices of prediction performance, accuracy, sensitivity, specificity, balanced accuracy (BA) and Matthews correlation coefficient (MCC), where TP, TN, FP, and FN represent the number of true positives, the number of true negatives, the number of false positives, and the number of false negatives, respectively: Sensitivity MCC ¼ TP � TN À FP � FN ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The receiver operating characteristic (ROC) curves were created using the ROCR package on R and the area under curve (AUC) was calculated based on the ROC curve.
For the final tree model, Mean Decrease in Gini [22] were calculated as importance scores for each variable. The score takes a value of 0 or more, and an explanatory variable with a large score is considered to make a substantial contribution to the construction of decision trees and discrimination of drugs.

Categories of the Anatomical Therapeutic Chemical (ATC) classification system
Information on the ATC categories of the target compounds was obtained from KEGG drug database (https://www.genome.jp/kegg/drug/drug_ja.html).

AMW of approved drugs
The information on drugs approved during the period from 2010 to 2019 in EU, USA and Japan was obtained from KEGG drug database. Peptides, nucleic acids and metal-containing drugs were excluded. Drugs marketed in multiple areas were labeled with the latest year.

Selection of data set and its chemical space
We obtained a total of 652 drugs, including 432 DILI drugs and 220 no-DILI drugs, as target compounds in this study. Using the Dragon 2D chemical descriptors calculated, the chemical space of these drugs was compared with that of 12,262 reference compounds, which were randomly selected from PubChem database, to understand whether the target compounds represent the chemical space of global chemical compounds. To this end, PCA was performed with the 1,539 descriptors of the 12,914 target and reference compounds, and scattered plots of all the combinations of principal component 1 (PC1), PC2 and PC3 are shown in Fig 1. The distribution of the target compounds well overlapped with that of the reference PubChem compounds. These results suggest that the 652 target drugs belong to the typical chemical space of known chemical compounds.

Classification tree model for DILI drugs
To investigate chemical properties associated with DILI, we constructed a classification model using the chemical descriptors as explanatory variables. The target compounds were divided into three groups, and using these subsets, three decision tree models were constructed with the CART algorithm with 5-fold cross-validation, and the accuracy, sensitivity, specificity and BA of the training data and test data were calculated. These steps were repeated 10 times and the means of the performance values are shown in Fig 2. The models showed high sensitivity (>0.8) and moderate accuracy (accuracy and BA, >0.6), but low specificity (<0.5).
The classification tree model independently constructed with 5-fold cross-validation using all the target compounds is shown in Fig 3. The confusion matrix that summarizes the results of testing the final tree model is shown as Table 1, and the importance values of the decision tree variables are shown in Table 2. The accuracy, sensitivity, specificity, BA, MCC and ROC-AUC of the model (Fig 4A) were 0.78, 0.89, 0.59, 0.74, 0.51 and 0.76, respectively. Although the specificity was relatively low as expected from the imbalanced dataset (i.e. DILI: no-DILI = 432:220), the model showed moderate discriminating performance. We found that the descriptor AMW, which represents average molecular weight, was the most important for discriminating DILI drugs from no-DILI drugs since it was used at the first node and showed the highest importance value (Fig 3, Table 2). The number of drugs whose AMW values are 7.4 or larger was 340, among which 276 (81%) were DILI drugs (Fig 3, Table 3).
MLOGP, Moriguchi octanol-water partition coefficient [28,29], was used at the second node and showed moderate importance (ranked 9th) (Fig 3, Table 2). The number of drugs whose AMW values are less than 7.4 and MLOGP values are less than -0.454 were 29, among which 28 (97%) were no-DILI drugs, indicating that MLOGP was important to detect no-DILI drugs in this model.

Evaluation of AMW as a novel discriminator between DILI drugs and no-DILI drugs
We found AMW, which is calculated by dividing molecular weight by the number of atoms, as a useful discriminator for DILI drugs from no-DILI drugs based on the decision tree model. We thus performed ROC analyses with AMW alone and AUC of 0.68 was obtained (Fig 4B), suggesting that AMW is a good descriptor for the discrimination. In addition, AMW is easy to understand and calculate, and should be available in the early stage of drug development. We therefore evaluated its usefulness hereafter. As shown in the contingency table (Table 3), the proportion of DILI drugs in the drugs having the AMW value of �7.4 was considerably high (81.2%) while that in the drugs having the AMW value of <7.4 was 50%. Moreover, among the 652 target drugs, 340 (52%; 276 DILI and 64 no-DILI drugs) have the AMW values of �7.4, and the proportion was increased to 64% (276 in 432) for DILI drugs only and that of no-DILI drugs only was decreased to 29% (64 in 220) ( Table 3). In this study, we grouped drugs with v Most-DILI-concern and v Less-DILI-concern in DILIrank as DILI drugs and we found that the ratio of the numbers of drugs having the AMW value of �7.4 to those having the AMW value of <7.4 was higher for the drugs with v Most-DILI-concern than those with v Less-DILI-concern (Table 3). Taken together, AMW with a threshold of 7.4 enabled us to identify about two-thirds of DILI drugs with 71% specificity and 66% accuracy (276 plus 156 per 652 drugs). Therapeutic category is generally available at the early stages of drug development. We thus investigated the applicability of the DILI/no-DILI discrimination using the combination of AMW and such information. To this end, the test drugs were sub-grouped by the first level category of the ATC classification system, which represents main anatomical/pharmacological groups (Table 4), and discrimination using "AMW � 7.4" as a criterion was carried out. The number of drugs belonging to each ATC first level category and the calculated accuracy, sensitivity, specificity and BA are shown in Table 4. In categories B (blood and blood forming organs), J (antiinfectives for systemic use), M (musculo-skeletal system) and S (sensory organs), high precision discrimination was observed with all the four indices being over 0.7.  On the other hand, in categories C (cardiovascular system), G (genito urinary system and sex hormones) and R (respiratory system), either sensitivity or specificity was below 50% and the accuracy was 60% or less. The lowest accuracy (accuracy, 35%; BA, 42%) was observed in category V (various). Finally, to evaluate the usefulness of AMW as a safety screening tool in drug development, we investigated the AMW values of newly approved drugs to understand the trend of the values (Fig 5). Among the drugs investigated (306 drugs), 179 (58%) have the AMW values of �7.4 (S3 Table), and the mean AMW values remained at around 7.9 during the last 10 years. With the exception of the year of 2013, the ranges of the AMW values were within that of the 652 target drugs when the outliers were excluded. It should be noted that the mean AMW value of the withdrawn drugs (53 drugs among the 432 DILI drugs) was much higher than the mean value of DILI drugs.

Discussion
To identify chemical properties useful for discrimination between DILI drugs and no-DILI drugs, we constructed a decision tree model using the chemical descriptors as explanatory variables. The first node rule of the decision tree constructed was "DILI-positive if AMW is �7.4", and we found that 64% of the DILI drugs were correctly classified by this criterion only and its AUC-ROC was 0.68. Moreover, only 29% of no-DILI drugs were found to have the AMW values of �7.4. In fact, the mean AMW values of DILI drugs and no-DILI drugs were 7.99 and 7.49, respectively, which were significantly different, and the mean AMW value of withdrawn drugs (8.50) was much higher than that of DILI drugs.
Among the descriptors used in this study, AMW (i.e. average molecular weight, which is calculated by dividing molecular weight by the number of atoms) is one of the simple descriptors, since it is easily calculated from the chemical formula of compounds without special software and knowledge on physicochemistry and computation. More importantly, AMW is available at the very early stage of drug development before chemical synthesis once target structures were determined. These facts suggest that AMW is a very simple and useful descriptor for discrimination between DILI drugs and no-DILI drugs and is applicable to medicinal chemistry to reduce hepatotoxicity potential during the drug development process. The value of AMW becomes larger as a molecule contains more heavy atoms, such as halogens, oxygen, phosphorus and sulfur atoms. In contrast, a saturated molecule with a large number of hydrogen atoms has a smaller value. While many small-molecule drugs, such as those used in this study, are usually composed of hydrocarbon structures modified with nitrogen, oxygen, phosphorus, sulfur and halogen atoms, the presence of halogens, such as chlorine and iodine, especially contributes to a significant increase in AMW values. Since the inclusion of these atoms is often associated with toxic alert structures and/or reactive metabolite formation [30], large values of AMW may reflect the reactivity of drugs with biomolecule.
Therapeutic category information is generally available at the very early stages of drug development, and thus such information might be used as different types of descriptors from conventional descriptors, such as Dragon descriptors used in this study. With ATC classification as target disease information, the ATC first level categories, in combination with AMW, were used for discrimination. We found that the accuracy of discrimination by AMW was dependent on the categories. The accuracy was high for categories B, J, M and S while it was low for the others. Since drugs with similar chemical structures are often included in a certain ATC first level category, the results suggest that the discrimination accuracy with AMW is influenced by total and/or partial chemical structures or other factors of drugs. In addition, these results also suggest that the applicability of AMW for DILI discrimination is limited to drugs with specific but not yet identified chemical structures and/or pharmacological activity. Identification of these properties in future studies will corroborate and further increase the usefulness of AMW in the discrimination of DILI drugs.
A survey of the AMW values of newly approved drugs in EU, USA and Japan over the last 10 years (2010 to 2019) has shown that the average values have remained around 7.9 from 2013 to 2019, implying that a large number of the approved drugs have DILI risk (i.e. AMW � 7.4) based on our present findings. Given that severe hepatotoxicity is reported for several drugs on the market, we believe that the criterion of "AMW � 7.4" is also valuable to evaluate the hepatotoxicity potential of drugs on the market.
In the final decision tree model, the descriptor MLOGP was used at the second node. MLOGP is a calculated logP value by Moriguchi's method and is reported to well explain experimentally determined logP values of more than a thousand compounds [28,29]. The importance of the logP-related value for the discrimination of DILI drugs agrees with the previous findings that lipophilicity is a key characteristic of DILI drugs [13,18].
As shown in Table 2, the importance values of AMW and GATS1m were very similar. Not only AMW but also GATS1m are considered to be descriptors related to DILI risk. In a very recent report using Dragon descriptors by Ancuceanu et al., GATS1m was frequently used in 17 feature selection algorithms based on the DILIrank dataset and its higher values are associated with the lack of hepatotoxicity [31]. In their study, SpPosA_B(m), H%, Mp, MLOGP and PCR were also reported as the descriptors associated with hepatotoxicity, some of which were used in our final decision tree model as well. However, AMW was not identified in their report. Although the exact reason is unknown, it may be due to a subtle difference in the set of drugs used in their and our studies.
There are several reports on in silico DILI prediction models using LTKB [31][32][33][34]. The accuracy of the models using a Decision Forest algorithm reported by Tong's group at FDA were 69.7% and 72.9% [32,33]. The ensemble voting model with various machine learning methods and several molecular fingerprints showed the accuracy of 77.25% (5-fold cross-validation) and 81.67% for the test set [33]. A Meta-model built with a variety of machine learning algorithms using Dragon descriptors showed similar accuracy (BA was 74.6%) [31]. In this study, the accuracy of the decision tree model with 5-fold cross-validation was 78%, and more importantly the discrimination with AMW only showed 66% accuracy, indicating AMW as a very useful single predictor for DILI drugs, although there are some limitations of the utility of AMW. First, AMW was identified from the limited number of drugs (652 drugs in the DILIrank dataset of LTKB) and we have not performed validation with an external dataset because no such a dataset is publicly available. Second, the discrimination accuracy (66%), sensitivity (64%) and specificity (71%) are not high enough to use AMW alone for drug screening and it may need to be used in combination with other descriptors. Finally, the applicability domain of AMW has not been identified. Based on the results obtained by the analyses using the ATC category, AMW is suggested to be applicable to only some types of drugs since the accuracy of the classification with AMW largely depended on the category.
In conclusion, using decision tree analyses with chemical descriptors, we obtained useful information of physicochemical properties of drugs for DILI discrimination. In particular, AMW has been found to be a useful chemical descriptor for DILI discrimination. Since calculation of AMW does not require chemical synthesis of candidate compounds, the consideration of AMW will be very valuable for DILI risk evaluation at the very early stage of drug discovery. Moreover, we found that the utilization of the disease area information (i.e. ATC classification information) increased the accuracy of DILI prediction using AMW. Taken together, the present findings provide useful information that is applicable to the safety assessment of drug candidates at the early stage of drug development.
Supporting information S1