Computational Intelligence Modeling of the Macromolecules Release from PLGA Microspheres—Focus on Feature Selection

Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven.


Introduction
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Hydrolysis of PLGA ester bonds in the human body leads to monomers, which can be introduced in Krebs cycle because they occur naturally. Therefore, PLGA is considered as a biodegradable and biocompatible polymer with minimal toxicity. A wide range of PLGA products is used as matrices for drug delivery systems such as microspheres, implants, in-situ gelling solutions, and medical devices (sutures, stents) [1]. Together with its biocompatibility, PLGA drug-protective properties as carriers are exploited i.e. for peptide drugs, DNA, RNA [2].
Despite an enormous amount of research focusing on PLGA microspheres, the complexity of such multi-compartment systems requires a more thorough understanding of their behavior. All types of variables can affect dissolution. It depends not only on polymer properties but also on drug type, particle size, the morphology of microspheres, release conditions, etc. [3]. Extracting knowledge from already gathered data is one approach to filling the research gap. Therefore, in this work, we have used existing data sets to discover complex relationships of releasing drug from PLGA matrices.
This study was conducted with the scope of analyzing the potential of some bio-inspired computational techniques for modeling experimental data coming from pharmaceutical sciences, where there is a need to find crucial variables governing the behavior of pharmaceutical dosage forms, in particular, PLGA microspheres. During the last decade, research for elucidation of macromolecular drugs release mechanism from PLGA-based drug delivery systems were in focus [4][5][6][7]. Most of the work is based on mathematical modeling [4,5,8,9]. Only a few are directly related to the usage of heuristic computational techniques for knowledge extraction [10,11].
Due to the recently introduced Process Analytical Technology (PAT) initiative [12] and the concept of Quality by Design (QbD) [13], there is a need to introduce to pharmaceutical industry multivariate, robust modeling techniques suitable for identification of crucial variables governing the analyzed processes. Moreover, PAT urges the need for thorough understanding of relationships responsible for pharmaceutical formulation behavior.
In QbD, this conforms with the selection of critical quality attributes (CQAs) and critical process parameters (CPPs), where an influence of formulation composition and technological process variables on desired quality of developed formulation should be taken into consideration. Numerous factors, like raw materials stability, lactide to glycolide acid ratio in polymers and technological process play a significant role in formulation quality. Thanks to the development of modern analytical techniques the number of potential CQAs and CPPs is counted in hundreds or even thousands of factors. In order to minimize the risk of inadequate quality of formulation, appropriate technology, and raw material is selected and controlled by careful choice of the minimal set of CQAs and CPPs. Reduction of control variables is performed both for the sake of clarity of reasoning and numerical stability of the developed models. For the latter, a so-called "curse of dimensionality" [14] is a rationale for simplification of the models. It is especially applicable in the holistic approach [15] integrating CQAs and CPPs into the single model. Such solution is known for improving knowledge management and is a natural domain for application of computational intelligence (CI) tools [15]. Data-driven models based on CI are developed without a priori assumption and based on automatically acquired knowledge could be used for selection of CQAs and CPPs in an empirical manner. However, for this task, a robust and efficient feature selection method is needed, which is in the focus of this work concentrated on the in vitro dissolution profiles of macromolecules from PLGA microspheres. Our rationale is that CI tools could select some of CQAs and CPPs relevant to this endpoint.

Materials
For consistency in comparison of the results for both approaches, original data was extracted from publication and the structure of the data was retained as in Szlȩk et al. [10]. In brief, the data was gathered after reviewing about 200 scientific publications. The extracted data consists of drug release profiles of 68 PLGA formulations from 24 publications. Originally, the input vector consisted of 320 variables (molecular descriptors of protein, excipients, formulation characteristics, and the experimental conditions) and 745 time points (records). All data were of numeric format with continuous values, except variables such as "Production method" and "Lactide to glycolide ratio" which took discrete values (1, 2, 3, etc.). The amount of the drug substance released (Q) was the only dependent variable as shown in Fig 1, its values were ranging from 0 to 100%. Data set which were used during the modeling have been uploaded on SourceForge [16].
Data set was split according Data set was split according to the 10-fold cross-validation scheme, each time excluding all data belonging to the particular formulation, thus simulating the real application of the model forced to predict the dissolution behavior of the unknown formulation [17]. Preliminary studies indicated that linear scaling of the data from 0.1 to 0.9 did not lower the prediction error, thus the data without normalization was used. We did not use external validation data set due to limited data available in the literature. In order to predict a dissolution for all the proposed models, the data from publication must have: protein incorporated into PLGA microsphere, its dissolution profile, protein with known structure to obtain chemical descriptors, process parameters etc. Therefore publications fulfilling all above criteria are scarce.

Procedure for crucial variables selection
The identification of smallest possible feature subset is a complex search problem, proved to be NP-Complete [18]. Therefore, an exhaustive examination of all possible subset combinations of features is not tractable for large feature sets. For our case, we have 2 320 possible subsets, which is roughly 2.13e 96 sets of features to be investigated. A reasonable approach is to use approximation algorithms or heuristics. We will use randomized search, specifically bio-inspired randomized search algorithms because there is evidence in the literature showing that they are effective for features selection [19][20][21][22]. These algorithms evolve a population of subsets of features towards those that meet specified criteria. We will try here four different bio-inspired algorithms [23]. In addition to these methods, we use the well known LASSO algorithm for comparison purposes.
Once a subset of features has been selected, the performance of a predictive model on the selected attributes is tested. We selected a number of well-known predictive tools such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines (MARS), classification and regression tree (CART), and hybrid systems of fuzzy logic and evolutionary computations (fugeR). In fact, the predictive models are not just used at the end of the selection process, but their are build into the evaluation (fitness) function of each of the feature selection methods. The feature selection methods improve a randomly generated solution iteratively, and each time an improvement is suggested, its quality is assessed by a predictive model.
2.2.1 Feature selection models. The proposed feature selection models are needed to find a minimal subset of input variables (features) from several hundred that minimize the model prediction error. The search space represents each variable as an individual dimension, and the span of each dimension ranges from 0 to 1 and hence requires an inventive searching method to find optimal subset in the huge search space that minimizes the fitness function. We define the fitness/objective function as follows: where P is the error of the prediction model, R is the size of selected variable subset, C is the total number of variables in the data set, α and β are two parameters corresponding to the importance of prediction performance and subset size, α 2 [0, 1] and β = 1− α.
Below we briefly describe the algorithms used for selecting the features. The main parameters of the bio-inspired techniques are given in Table 1, but there are other parameters specific to each method which is set either according to domain-specific knowledge or based on trial and error on small simulations. Select an antlion using Roulette wheel. 10: Slide ants toward the antlion. 11: Create a random walk for the Ant i and normalize it 12: end for 13: Calculate the fitness of all ants. 14: Replace an antlion with its corresponding ant; if the ant becomes fitter).

15:
Update the elite; if an antlion becomes fitter than the current elite. 16: t = t+1 17: end while 18: Select the elite antlion elite and its fitness.

Grey wolf optimization.
Grey Wolf Optimization (GWO) is bio-inspired heuristic optimization algorithm that imitates the way in which wolves search for food and survive by avoiding their enemies [26]. Grey wolves are social animals that live in groups, and the pack size contains between 5 to 12 wolves on average. In the computational model, α is the fittest solution, β and δ are the second and third best solutions. The hunting process is guided by α, β, and δ while the ω follow them. The algorithm is outlined in Algorithm 3, with more details being available in [27,28].

Social spider optimization.
Social Spider Optimization (SSO) algorithm mimics the social behavior of the spider colony in nature [29]. SSO is swarm-based and consists of two main components: social members and communal web [30]. Social spider optimizer is formally presented in algorithm 4, with more details about the algorithm implementation available in [29,31,32].  Calculate the Vibc i and Vibb i . 10: end for 11: Calculate the male cooperative operator. 12: Calculate the Vibf i . 13: Perform the mating operation. 14: Update the best solution; if a spider becomes fitter than the best. 15: end while 16: Produce the best spider position and its fitness.
Besides the bio-inspired methods above, we also used Least absolute shrinkage and selection operator (LASSO) for comparison purposes. Tibshirani has introduced an attractive LASSO method for shrinkage and variable selection in 1996 [33]. LASSO method applies the penalty concept; the optimization should depend on the quadratic program (QP) or non-linear general program that is recognized to be computationally expensive. LASSO performs better prediction accuracy by shrinkage as the ridge regression. Therefore, LASSO is a useful tool to accomplish the reduction (shrinkage) and variable selection operations simultaneously [34].

Predictive models
1. Cubist: is a package which implements Cubist rule-based predictive decision trees development firstly proposed by Quinlan [35]. Cubist models introduce linear equations at their terminal branches; therefore, they can predict numeric values. The maximum number of rules was fixed at 100, the number of committees was set from one to 100. The extrapolation parameter, which controls the estimation ability of created models beyond the original observation range, was set to 100. The sample parameter, which is a percentage of the randomly selected data set for model building, was established at zero, which means that no data subsampling was employed, and all the models were built on previously prepared 10cv data sets [36].
2. Random Forest (RF): creates an ensemble of decision trees using random inputs. Package randomForest of R environment was used [37]. It implements the Fortran code proposed by Breiman and Cutler [38]; therefore, it is suitable both for classification and regression problems. Similar to classification, regression forest is formed by growing trees depending on a random vector, but instead of categorical response in case of classification, the tree predictor takes on numerical values. Then the output predictor is formed by taking the average over all trees [39]. During model development, the following parameters have been used: number of randomly selected variables at each split was between 1 and half the size of a vector (mtry), maximum number of nodes was set between 10 and 500 (maxnodes), and number of trees was set from 10 to 500 (ntree).
3. Monmlp (monotonic multilayer perceptron): [40,41] is used to take advantage of the learning without back-propagation. Thus, monotonicity has been turned off. All of the prepared models had two hidden layers, each one numbering from 2 to 20 nodes. The hidden layer uses hyperbolic tangent (tansig) transfer function, and the output layer uses linear function applied. Ensemble systems were employed and consisted of ten neural networks. Variables have been scaled linearly from 0.1 to 0.9. Epoch has been set from 50 to 1,000. The "trials" parameter has been set to 5 to avoid getting stuck in local minima.

4.
Deep learning neural networks (h2o): we observed that, in order to properly train neural networks of complicated functions, noisy, or nonlinear data, deep architectures may be needed. The term "deep architecture" refers to a neural network composed of multiple hidden layers with many neurons within each layer. Deep learning neural networks are used to solve complex problems by introducing combinations of simpler solutions. Therefore, those systems can operate in real-world environments [42]. Hyperbolic tangent has been used as activation function of choice. Epochs varied from 1,000 to 10,000,000. Neural nets consisted of two to eight hidden layers with two to two hundred nodes per layer. Overall more than 250 architectures have been trained and tested [43].
5. fugeR: an idea of applying genetic algorithms in the field of fuzzy modeling appeared in the early nineties of last century [44]. Fuzzy systems gain the advantages of evolutionary techniques to develop a model based on large, and often complex, data sets. At the same time retaining its simplicity of representing data as linguistic variables and describing relations between variables by conditional rules. The package fugeR is designed for training fuzzy systems based on evolutionary algorithms. A maximum number of rules developed during training varies from 10 to 500. Maximum variables per rule have been set from 4 to 9. A number of generations considered are 50, 100, 200, 500 or 1000, and the population was either 100 or 500. The elitist parameter was set to be 20% out of every generation [45].
6. Classification and regression tree (CART): was used in order to compare with more sophisticated models such as the random forest. CART was first introduced by Breiman et al. [46]. CART is a machine-learning method for constructing prediction models from the data. The models are trained by partitioning the multidimensional space and fitting a simple prediction model within each partition. When a modeling is done, the results can be represented graphically as a decision trees. Regression trees, which were used in this article, are designed for dependent variables that take continuous or discrete values. In this case, the prediction error is typically measured by the squared difference between the observed and predicted values [47].
7. Multivariate adaptive regression splines (MARS): was introduced by Jerome H Friedman in 1991 [48]. MARS model is a weighted sum of constant and basic functions multiplied by the coefficients. Basic function in MARS models is so called hinge function. The model development is usually composed of 2 steps. In the first step, the model is created from the single intercept and is extended iteratively by adding pairs of hinge functions. This process leads to a reduction of training error and produces the overfitted model. Therefore, during the next step redundant, basic functions are removed from the model to improve generalization ability of the final model [49]. In this work, earth package for R environment was implemented in presented work as an example of multivariate adaptive regression splines method [50].

The proposed system evaluation
The measures described in the following section are used as indicators of the quality of solutions for the computational models. Root mean square error (RMSE) is used by both the feature selection and the predictive models. Normalized RMSE is used for clarity of presentation and comparability with other works. Each algorithm has been applied 20 times with random positioning of the search agents. Repeated runs of the optimization algorithms were used to test their convergence capability.

Performance metrics
We use the following notations: • obs i and pred i are the observed and predicted values respectively; • X max and X min are the maximum and minimum observed values respectively; • μ is the mean of the observed values; • n is total number of samples; • i is the index of a sample in the data set.
The indicators (measures) used to compare the different algorithms are as follows: 1. Root mean square error (RMSE): measures the average squared root errors, the error being the difference between the observed output and the predicted output, as given in Eq (2): 2. Normalized root mean square error (NRMSE): is the normalized root mean square error (NRMSE): as in Eq (3).
3. R-squared (R 2 ): is the proportion of variability in a data set that is accounted for by a statistical model as given in Eq (4).
3.1.1 Results and discussions. Overall, regarding all methods and 10-cv (cross-validation) sets, nearly 18,000 models were trained and tested. Feature selection models selected various combinations of features as presented in Table 2. Fig 2 shows the frequencies of features regarding all subsets of features selected by ALO, BALO, GWO, and SSO (orange circles) together with features obtained by Szlȩk et al. [10] (blue circle). Connections are drawn when a variable is present at least in two subsets of features. NRMSE varied from 31.1% to 15.9%. Random Forest algorithm yielded the lowest error; therefore, it was used for selecting optimal inputs vector as in Table 3.
RF model developed on the nine input vector, 9(2) (purple circle in Fig 2), selected by BALO algorithm, yielded one of the lowest NRMSE. Deep learning neural networks applied to the same vector resulted in error of 16.87%. Selected inputs are shown in Table 4. Comparable results were obtained, 15.97% versus 15.4%, to those by Szlȩk et al. [10], but the vector of inputs was smaller, nine versus eleven. It is a rule of a thumb that increasing the number of independent variables improves the model fitness; however, the model might be unnecessarily complex. In this case, we have reduced the complexity by reducing the number of inputs by two (over 18% reduction of the number of inputs) at the same time NRMSE raised only of 0.6%. It can be noted from Table 4 that the methods such as MARS and CART failed, giving the lowest NRMSEs of 19.10 and 19.97 respectively.
Both BALO and the method described by Szlȩk et al. [10] lead to the selection of 'Time (days)' and 'PVA inner phase concentration (%)' as crucial parameters governing the dissolution process. Moreover, as it is depicted in Fig 2, variables 'PVA inner phase concentration (%)'  (circle labeled as 89), and 'Time (days)' (labeled as 300) were also selected by most of the algorithms. Not surprisingly the time was the most common variable for all data sets. The time variable determines dissolution profile both physically and numerically-the latter is due to the structure of data set, where each dissolution point is represented by separate data record. Despite high occurrence of the variables belonging to formulation characteristics, e.g.   'Encapsulation ratio' (circle labeled as 93) or 'Production method' (labeled as 101), it is worth pointing out that also protein molecular descriptors such as, 'Minimal projection radius' (labeled as 35) and 'No of aliphatic bonds' (labeled as 7) were present. This may suggest an influence of protein size on its dissolution from PLGA microspheres. Moreover, BALO algorithm (BALO_9 (2)) separated four variables describing peptide drugs and three for excipients. In contrary, Szlȩk et al. [10], have pointed to three inputs representing proteins, but none of them described excipients (Table 5). Depending on the algorithm used, these chemical descriptors are either included or not in the features vector; thus their impact on dissolution is not well established. Further enhancement of the database would be required to elucidate the above relationships.
Overall RF model's performance is a result of an average error of separate formulations. As it is depicted in Fig 3, the error may vary and it depends on formulation characteristics. An example of good prediction, which falls below 5%, is presented in Fig 3-A. In contrary, failed prediction is showed in Fig 3-B (NRMSE of 20.6%). Therefore, errors of 10-cv predictions were closely investigated. The NRMSE of 24 formulations fell below 10%, 19 formulations had an error between 10 and 15%, and the rest 25 formulations showed error higher than 15%. It indicates good performance of the model for more than two-thirds tested formulations. When constituents of microspheres were investigated it was observed that, out of 14 proteins, only for four of them predictions failed (chymotrypsin, human serum albumin, insulin, L-asparaginase). Therefore, it is strongly recommended not to use the final model to predict dissolution profiles for  Table 5. Comparison of selected features by BALO and obtained by Szlȩk et al. [10].
No of heteroaromatic rings-protein descriptor.
No of heteroaliphatic rings-protein descriptor.
Lactide to Glycolide in polymer ratio-formulation characteristics.
Time (days)-assay conditions. Dissolution pH-assay conditions. Production method: 1-w/o/ w, 2-s/o/w, 3-s/o/o,4-spray-dried-formulation characteristics. Time (days)-assay conditions. PLGA microspheres containing those proteins. The high errors for those formulations could be the result of the small number of formulations consisting those proteins. On the other hand obtained errors for bovine serum albumin, recombinant human erythropoietin, recombinant human epidermal growth factor, lysozyme, recombinant human growth hormone, hen ovalbumin, beta-amyloid, recombinant human erythropoietin coupled with human serum albumin, bovine insulin, alpha-1 antitrypsin were below 15%, which points that the most of the dissolution profiles were well predicted. Nevertheless the results obtained show that random forest is useful to predict dissolution behavior from PLGA microspheres. Moreover, the results obtained come from a model trained on the use of enhanced 10-cv technique, which may indicate that the model will be useful even if unknown formulations are meant to be predicted.

Conclusions
In this paper, we address the PLGA feature selection problem by using bio-inspired optimization algorithms such as antlion optimization (ALO), binary version of antlion optimization (BALO), grey wolf optimization (GWO), and social spider optimization (SSO), and also LASSO algorithm to select optimal feature subset for predicting the dissolution profile of PLGA. Feature selection is considered as a biobjective optimization problem that minimizing the prediction error of the data analysis model while minimizing the number of features used. A set of input features were employed to find minimum generalization error across different predictive models and their settings/architectures. We evaluated our proposed solution using different predictive modeling algorithms such as cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), MARS, CART, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with Szlȩk results. We obtained a root mean square error 15.97% versus 15.4%, but the selected input features was smaller, nine versus eleven.