Quantitative Structure-Property Relationship (QSPR) Modeling of Drug-Loaded Polymeric Micelles via Genetic Function Approximation

Self-assembled nano-micelles of amphiphilic polymers represent a novel anticancer drug delivery system. However, their full clinical utilization remains challenging because the quantitative structure-property relationship (QSPR) between the polymer structure and the efficacy of micelles as a drug carrier is poorly understood. Here, we developed a series of QSPR models to account for the drug loading capacity of polymeric micelles using the genetic function approximation (GFA) algorithm. These models were further evaluated by internal and external validation and a Y-randomization test in terms of stability and generalization, yielding an optimization model that is applicable to an expanded materials regime. As confirmed by experimental data, the relationship between microstructure and drug loading capacity can be well-simulated, suggesting that our models are readily applicable to the quantitative evaluation of the drug-loading capacity of polymeric micelles. Our work may offer a pathway to the design of formulation experiments.


Introduction
Cancer, as one of main diseases, will soon replace heart disease to become world's leading killer, seriously threatening human health [1]. Chemotherapy is an effective method for the treatment of cancer. However, the toxicity and side effects of anticancer drugs can be life threatening, and most of these drugs suffer from poor water solubility, have short half-lives, and can cause tumor cells to become drug resistant [2]. Extensive research is therefore being conducted on new formulations that might effectively improve the curative effect of anticancer drugs while minimizing or even eliminating their toxicity and side effects. One promising approach is to consider using a new drug delivery system to control the target delivery of drugs to the lesion [3,4]. Nano-micelles self-assembled from amphiphilic polymers represent a candidate system due to their many advantages, including small particle size, release behavior, improved drug solubility and the potential for both passive and active targeting ability [5][6][7][8][9]. Until now, many (PCL) 3 (PDEA-b-PPEGMA) 3 , three kinds of four-homoarm star polymers (PCL-b-PDEA-b-PPEGMA) 4 , and two kinds of six-homoarm star polymers (PCL-b-PDEA-b-PPEGMA) 6 . Fifteen kinds of designed polymers were synthesized in our laboratory. Drug-loaded micelles were prepared using two DOX/polymer ratios (10/40 and 20/40 (mg/mg)), resulting in 30 data LC datasets (LC datasets were transformed using natural logarithms (ln) to better conform to the normal distribution), as shown in S1 Table [36][37][38]. All LC data were measured in aqueous solution at pH 7.4 and room temperature. All details are shown in Fig. 1 and S1 Table. To build and validate a QSPR model of good general ability, the original dataset was divided into a training set and a test set using the Duplex method which was an effective way to select representative training and test sets [39]. Briefly, Euclidean distances were calculated between each pair of samples as follows: (1) The two samples that are furthest away from each other are selected for the training set. (2) From the remaining samples, the two samples that are furthest away from each other are included in the test set. (3) The remaining point that is furthest away from the two points in step (1) is included in training set, and the one furthest away from the two points in step (2) is included in the test set. Finally, the datasets were split into the training set (containing 22 data) and the test set (containing 8 data, but not less than five) [40]. See S1 Table for a detailed list.

Computational details regarding descriptors
Total fifty-two kinds of molecular descriptors (including physical and chemical descriptors, fragment counts, topological, spatial and energy descriptors, atomic volumes and surfaces, and atomistic descriptors which are suitable for describing the relationship between polymer molecular structure and micellar property) could been computed using the QSAR module of Materials Studio 5.0 (Accelrys Inc.) [41] and used as candidates, as shown in S2 Table. To obtain the lowest energy conformation of the polymer molecules, the geometric structures of all polymer molecules were constructed and optimized using the following three steps. First, the initial polymer structure was constructed in the Materials visualizer. Then, the polymer structures were minimized in the Discover Module by setting the optimization method to smart minimizer, the force field type to compass, the convergence level to ultra-fine, the non-bond methods to van der Waals and Coulomb interactions, and the summation method to atom-based. Finally, the structures were optimized using molecular dynamics simulation for the NVE and NVT ensembles in turn, and the temperature was set to 298K; the number of steps was set to 10,000. Then, values of 52 descriptors were calculated using the QSAR and Forcite Modules of Materials Studio 5.0 based on the polymer structure that was optimized as described above. These descriptor values were separated according to the quality ratios of the drug/polymer. Constant terms, near-constant values and pairwise-correlated descriptors (one of any two descriptors with a correlation coefficient greater than 0.99) were excluded in a pre-reduction step. The remaining 36 descriptors and their detailed values for the 22 polymers, showing in S3 and S4 Tables, were selected and calculated for the following research.

Model development
In this study, some methods were used to develop and evaluate a reliable model. GFA-MLR technique, derived from an analogy with the evolution of DNA [42], was used to generate a series of models. In GFA algorithm, an individual or model was represented as one-dimensional string of bits. It was a distinctive characteristic of GFA that it could create a population of models rather than a single model. GFA algorithm, selecting the basis functions genetically, developed better models than those made using stepwise regression methods.
And then, the models were estimated using the "lack of fit" (LOF), which was measured using a slight variation of the original Friedman formula [34,35], so that best model received the best fitness score. The LOF was determined by the following revised equation in Materials Studio 5.0: where, SSE is the sum of squares of the errors, c is the number of terms in the model (other than the constant term), and d is a scaled smoothing parameter, which is used to evaluate the Friedman LOF statistic. Larger values of the smoothness parameter cause larger models to be penalized to a greater degree. p is the total number of descriptors contained in all model terms (ignoring the constant term), n is the total number of samples in the training set, and λ is a safety factor, which was assigned the value 0.99 to ensure that the denominator of the expression would never become zero. The applicability domain for the best QSPR model should be defined [43]. In this work, we have applied the leverage approach to investigate the domain of applicability of polymers [44]. It is a great benefit of this approach that can draw the Williams plot (a plot of standardized predicted residuals versus leverage values) to observe the applicability domain of a QSPR model.

Model validation
The fitting ability, stability, reliability and predictive ability of the developed models were evaluated by some validation [43][44][45][46], including internal and external validation parameters.
Internal validation parameters R 2 : R 2 (the square of the correlation coefficient) describes the fraction of the total variation attributed to the model. The closer the value of R 2 is to 1.0, the better the regression equation explains the Y variable. R 2 is the most commonly used internal validation indicator and is expressed as follows: where, Y obs ; Y pred ; Y training are the experimental property, the predicted property and the mean experimental property of the samples in the training set, respectively. R 2 cv : R 2 cv (the cross-validated correlation coefficient) is derived from cross validation and is the cross-validated equivalent of R 2 . In general, the developed models were subjected to internal validation using the LOO-cross-validation method. R 2 cv is expressed as follows: where, Y obs and Y training have the same definitions as in Equation (2) above. Y 0 pred is the LOOpredicted property of the samples in the training set. The closer this value is to 1.0, the better the predictive ability of the model [45].
RMSE: The root mean square error (RMSE) is dispersion degree of random error, presenting a more intuitive index of the fitting ability of the model [46], and is defined as follows: where, i represents sample i and n is the total number of samples in the training set. The lower the value of RMSE is, the better the predictive ability of the model. Y-randomization test: The Y-randomization test is a statistical method that is widely used to test the reliability and robustness of a model [47]. The purpose of the Y-randomization test is to detect and quantify chance correlations between the dependent variable and the descriptor. In this method, a QSPR model is recalculated for randomly reordered responses, and the descriptor matrix remained unchanged. The obtained models should exhibit significantly lower values of R 2 or R 2 cv than the original model because the relationship between the structure and the response has been broken. Y-randomization is performed through response scrambling with maximum iterations of 500; then, the mean values of R 2 or R 2 cv are calculated. If either R 2 or R 2 cv is higher than 0.5, the original model is suspected to be relevant, and the reliability of the QSPR model is doubtful [48].

External validation parameters
Internal validation is an essential step in QSPR model development. The desired internal validation results show that the model exhibits higher stability and prediction ability. However, no real prediction ability is shown for external samples. Therefore, the external predictive ability and extrapolation of the models should be evaluated. R 2 pred : R 2 pred is termed the predictive R 2 of a development model and is an important parameter that is used to test the external predictive ability of a QSPR model [48]. R 2 pred is defined as follows: where, Y obsðtestÞ ; Y pred ðtestÞ ; Y training are the experimental property, the predicted property and the mean of the experimental property of the samples in the test set, respectively. r 2 m and Dr 2 m : r 2 m and Dr 2 m are developed to obtain a true predictive QSPR model based on the r 2 m and r 0 2 m parameters proposed by Roy et al [49]. r 2 m , which has been found to be a better metric than the original r 2 m , is the average value of r 2 m and r 0 2 m . Dr 2 m is the absolute difference between r 2 m and r 0 2 m . If the value of r 2 m is higher than 0.5, the value of Dr 2 m should be lower than 0.2 [50]. In general, better models exhibit lower Dr 2 m values. The equations for these parameters are shown in S5 Table. Applicability domain After internal and external validation, it cannot be claimed that this model will provide reliable results for any unknown sample, even though all evaluation indexes prove that the model is stable and reliable and exhibits good generalization and predictive ability. In fact, each model has its own applicability domain. If the predictive value of the sample lies within this applicability domain, the value is reliable. On the contrary, the model remains unreliable [50]. Thus, it is necessary to define the applicability domain of the model before it is applied to simulate unknown samples.
The leverage approach is generally used to define applicability domain of a model, and the leverage approach requires the assumption that the datasets used follow the normal distribution [44]. This approach can quantify the applied range of the model, and the results can be presented using the Williams plot, in which the leverage values (or Hat values) and cross-validated standardized residuals are used as the abscissa and ordinate, respectively. In this plot, two horizontal lines and one vertical line delimit a safety zone. The critical leverage h Ã (the vertical line) is generally fixed at 3(p + 1)/n (where n is the number of training compounds and p is the number of variables used in the model, respectively). The leverage (h i ) of every sample is defined as follows: where, x i is the descriptor row-vector of the query ith sample, X is the characteristic matrix of the training set, and m is the number of query samples. A higher leverage (h i > h Ã , outside the safety zone) indicates that the predicted response is unreliable because it goes beyond reasonable extrapolation. If the cross-validated standardized residual of the sample is lower or higher than three standard deviation units (represented as the two horizontal lines), the sample remains outside the reasonable range. Only values that fall within this area are considered reliable.

Results and Discussion
Determination of the optimal descriptor number To select the descriptors that are most relevant to the ln(LC) of the polymeric micelles, 36 descriptors, which were calculated using Materials Studio, were used as inputs for the GFA algorithm. The optimal subset size was realized when further increases in the descriptor did not significantly improve model performance. Here, LOO-cross-validation was used to estimate the models. A plot of R 2 cv and R 2 against the number of selected descriptors is shown in Fig. 2 and indicates that the optimal model consists of five descriptors [39]. Moreover, R 2 cv is a key parameter regarding the predictive ability of the model. The closer the value of R 2 cv is to 1, the better prediction ability the model can deliver. For a good model, R 2 cv should be closer to R 2 (R 2 cv is usually lower). If R 2 cv is far smaller than R 2 , then the possibility of data over-fitting in the regression model becomes significantly higher. When the number of descriptors is 5 (Fig. 2), R 2 cv fits R 2 most comfortably, indicating that the optimal subset size is five.

Model building
To develop the optimization model, we included 22 samples in the training set. The number of descriptors in the regression equation was 5, and Population and Generation were set to 1,000 and 5,000, respectively. The number of top equations returned was 10 (starting from the tenth model, the value of R 2 is lower than 0.9). Mutation probability was 0.1, and the smoothing parameter was 0.5. The models were scored based on Friedman's LOF. The statistical parameters of the ten models are shown in Table 1. Regarding the external validation parameters for Model 1 (Table 1), R 2 pred is the highest (0.811) and RMSE(b) is the lowest (0.1121) among the models. The r 2 mðtestÞ (0.75) of Model 1 is higher than 0.5, and the Dr 2 mðtestÞ (0.04) is much lower than 0.2. These results show that Model 1 exhibits slightly higher external prediction ability than the other models.
As seen in S6 Table, Model 1 shows the minimal residuals, possessing the optimal prediction ability. With regard to the optimization model, all of the predicted values of drug loading capacity of the samples in the training and test sets are close to their experimental values, as shown in S1 Table. Moreover, Fig. 4 shows that the sample points in the training and test sets are uniformly distributed along the line y = x, indicating that the residual errors of the predicted and experimental values are very low for Model 1. For example, the predicted value of No. 26 is 14.4, very similar to the experimental value 14.5, and the residual value (0.0040) is also much lower than 0.1.
S7 Table is the correlation matrix of the five descriptors in the optimization model. According to the previous research [51], all the correlation coefficients are less than 0.95, indicating the correlation analysis indicated that SSOV, SSA, EV TPE and IE are not highly correlated.  Furthermore, if the multi-collinearity was present, in order to avoid its effects, probability (p) values of each coefficient are used to check whether multi-collinearity is affecting a correlation. It is generally accepted if the p-value is less than 0.05 [52]. Table 2 shows that all the p-values of five descriptors in Model 1 are very low (p 0.005), showing that multi-collinearity could not affect the correlation here.
Based on the above analysis, the 5-parameter version of Model 1 was selected as the optimization model, as shown in Table 2 Applicability domain of the optimization model  , respectively, and these values are the largest found among the ten models. Moreover, RMSE(a) was the smallest among the models, and the difference between R 2 and R 2 cv was the lowest. These results suggest that Model 1 exhibits the best fitting ability and the best internal predictive ability. To examine the stability of Model 1, the Y-randomization test was conducted, and a residual scatter diagram was plotted. After repeating the Y-randomization test more than 500 times, the mean values of R 2 and R 2 cv became 0.043 and 0.005, respectively, much lower than 0.5, indicating that Model 1 is more stable, and there is no "chance correlation" phenomenon occurring for Model 1. As seen in Fig. 3, the points are distributed irregularly and randomly, proving that Model 1 is more stable.

Descriptor interpretation and mechanism analysis
By interpreting the descriptors in the optimization model, it is possible to gain some insight into the factors that are likely to govern the ln(LC) of these polymeric micelles. The relative importance of the descriptors was determined based on their standardized regression coefficients. Table 2 shows that the most important descriptor is the SSOV, which is the solvent surface occupied volume of the polymers. This descriptor is mainly correlated with the hydrophilicity of a polymer molecule in aqueous solution. The SSA is a solvent surface area descriptor of the polymers and describes the hydrophobicity of polymer molecules in aqueous solution. EV is the ellipsoidal volume descriptor, which describes the volume of the ellipsoid of inertia, which is derived from the inertia tensor of the system. This descriptor is mainly correlated with the rigidity (indirectly reflecting the hydrophobicity) of polymer molecules in aqueous solution. IE is the inversion energy descriptor, which describes the energy required to transform a molecule from one spatial form (or invertomer) to another. This descriptor is a good indicator of the nature of the bonding between the atoms of a polymer.  Table 3 presents the relationship between the five descriptors and the three monomers. As shown in the table, positive correlations exist between PCL or PDEA and the five descriptors, whereas negative correlations exist between PPEGMA and the corresponding parameters. The correlation coefficients for the former group are slightly higher than those of monomers. An unknown synergistic effect might occur between the two hydrophobic blocks (PCL and PDEA). Table 4 shows the contributions of different segments to the descriptor values. The calculated values are based on the data presented in S9 Table using the PPEGMA monomer as a reference. For example, the contribution to SSOV caused by the DEAEMA monomer is calculated as follows: C DEA = (705.01/1583.56)/(187.28/533.45) = 1.27. With regard to the descriptor SSOV, different contributions resulted from the three monomers in the order C CL > C DEA > C PEGMA .
As shown in Table 4, the hydrophobic segments (CL and DEA) in the polymer exhibited enhanced SSOV, resulting in a higher value of SSOV. A larger micellar core might be induced by segments that are more hydrophobic, thus providing a larger surface area for solvent and resulting in higher value of SSOV. Conversely, more hydrophilic segments (PEGMA) yield lower values of SSOV, possibly because the PEGMA segment distributed into the surface of the micelle. The same is true for SSA. The effect of the PEGMA segment on EV is greater than those of the CL and DEA segments. However, the relationship between hydrophilic segments and EV is negative, as shown in Table 3. The pH-sensitive DEA segment is the main factor for the descriptor TPE, and this segment exerts a positive affect. The DEA segment is also the largest contributor to the descriptor IE and provides a much greater contribution (up to 114-fold) than those of the CL (10-fold) and PEGMA (1-fold) segments. Equation (7) and S10 Table demonstrate that the descriptors SSOV, SSA and IE are the main factors in the optimization model. Because the contributions of the three descriptors are lower than that of EV, the coefficients of the three descriptors are much higher than that of EV; IE shows the most dramatic change among the five descriptors. The equation describing the model shows that the drug loading capacity can be enhanced by reducing SSOV and increasing the remaining four descriptors (SSA, EV, TPE and IE). According to the above analysis, although increases in CL and DEA enhance the values of SSOV (reducing LC) and the other four descriptors (increasing LC), the overall effect is to increase LC. An increase in PEGMA decreases the value of SSOV (increasing LC) but decreases the other descriptors, reducing the LC overall. In other words, to enhance the drug  loading capacity of the micelles, the number of hydrophobic segments (DEA and CL) should be increased and the number of hydrophilic segments (PEGMA) should be decreased; this conclusion is consistent with the experimental observations. Thus, the amounts of CL and DEA positively and significantly affect the drug loading capacity. The hydrophobic CL and DEA units are located in the core of the spherical micelle and provide sufficient space to accommodate the anticancer drug DOX in a synergistic manner. Decreasing the number of PEGMA units is also important for increasing the drugloading capacity.

Conclusions
This work addresses the QSPR between a series of amphiphilic four/six-arm star polymer structures and their micelle drug-loading capacities. Our study developed and evaluated ten models based on the GFA, the LOF function, MLR and LOO-cross-validation methods. The optimal descriptor number of the model was determined to be five based on the statistical results of the LOO-cross-validated correlation coefficient (R 2 cv ) and the square of the correlation coefficient (R 2 ). By analyzing the internal and external validation parameters of all models, the QSPR model was confirmed as optimal and to possess good fitting ability, good predictive ability and high stability. The influence of polymer structure on micelle drug-loading capacity was also analyzed. This study provides an effective approach for the design and synthesis of new star polymers with desired drug-loading capacities.
Supporting Information S1