Particle swarm optimization-based automatic parameter selection for deep neural networks and its applications in large-scale and high-dimensional data

In this paper, we propose a new automatic hyperparameter selection approach for determining the optimal network configuration (network structure and hyperparameters) for deep neural networks using particle swarm optimization (PSO) in combination with a steepest gradient descent algorithm. In the proposed approach, network configurations were coded as a set of real-number m-dimensional vectors as the individuals of the PSO algorithm in the search procedure. During the search procedure, the PSO algorithm is employed to search for optimal network configurations via the particles moving in a finite search space, and the steepest gradient descent algorithm is used to train the DNN classifier with a few training epochs (to find a local optimal solution) during the population evaluation of PSO. After the optimization scheme, the steepest gradient descent algorithm is performed with more epochs and the final solutions (pbest and gbest) of the PSO algorithm to train a final ensemble model and individual DNN classifiers, respectively. The local search ability of the steepest gradient descent algorithm and the global search capabilities of the PSO algorithm are exploited to determine an optimal solution that is close to the global optimum. We constructed several experiments on hand-written characters and biological activity prediction datasets to show that the DNN classifiers trained by the network configurations expressed by the final solutions of the PSO algorithm, employed to construct an ensemble model and individual classifier, outperform the random approach in terms of the generalization performance. Therefore, the proposed approach can be regarded an alternative tool for automatic network structure and parameter selection for deep neural networks.


Introduction
The current growth in Internet information and computational hardware development, such as Facebook and other well-known business websites, is enabling a wide range of researchers to utilize various advanced techniques, such as machine learning tools, to capture available and important information and analyze this processed information to provide better business PLOS  decisions for the government and industry. A critical challenge in the development of advanced techniques and application of them to process more complicated problems is how to select more appropriate models that fit the specific data collected from real-world applications well. Although machine learning approaches has been successfully applied in a wide range of pattern recognition and prediction tasks, limitations of these techniques in processing largescale and high-dimensional data still exist. Typically, in most pattern recognition tasks, the data is usually collected from a group of digital images or action videos, and each image of these data contains a high-dimensional pixel vector. Convolution machine learning approaches, such as support vector machine (SVM), have difficulty processing such data due to their shallow structures, which fail to capture the high-level abstract feature representation from the high-dimensional row data. The above difficulties and challenges have inspired most research groups and companies to find and develop more advanced techniques to address such issues. Deep learning techniques are among the most popular and representative methods of these advanced techniques. In recent years, deep learning techniques have been widely developed and applied in many real-world applications due to their excellent performance on large-scale and high-dimensional datasets. Deep learning techniques have been successfully applied in domains such as human action recognition [1][2][3][4][5][6][7][8], text processing and applications [9][10][11][12][13][14], medical image processing [15][16][17][18][19][20][21][22][23][24][25][26], and computational biology [27][28][29][30][31][32][33][34][35][36]. Conventional machine learning algorithms usually have some limitations in their ability to address natural data due to the complex structure of their raw form. For decades, constructing a classification model or pattern recognition system usually required careful investigation and considerable background information regarding the optimization problem to design a feature representation tool that transformed the input data (original data, such as the words of one review) into another data representation that is suitable for classifiers to process it and discover the classification pattern between the inputs and outputs. Feature reduction and feature learning are always important issues in the machine learning domain and are usually employed in combination with other machine learning algorithms, such as support vector machine, to provide a significant improvement in performance on recognition tasks compared to the standard machine learning algorithms. In addition, deep learning techniques can be seen as specific feature learning or feature reduction tools with multiple processing layers that abstract the highlevel feature representation. A deep learning architecture usually contains one input layer, two or more processing layers, and one output layer. The degree of feature abstraction of the deep learning architecture is dependent on the depth of its neural network and the number of hidden neurons. However, although a more high-level network structure employed to construct a deep learning architecture may provide more powerful capabilities in feature representation, this is not suitable for only a few samples with low-dimensional attributes, and the choice of depth of neural networks is a trade-off between generalization capability and computational complexity.
In many forms of machine learning, shallow or deep, supervised machine learning is the most popular and frequently used method to construct a classification model or pattern recognition system. The main idea of supervised learning is to update the adjustable network parameters, i.e., weights and biases, by receiving a huge number of real-world samples, such as a car, a pet, a house or a person. To train a network using the supervised learning method, the first step is to collect a large volume of data from appropriate real-world domains, and each sample (instance) must be labeled with its category. During the training phase, these collected data are employed as input data for the classifier and then transformed into outputs through the many non-linear modules that connect the multiple processing layers of a deep network. Generally, we want an instance or a sample that is fed into a classifier to yield the desired output. However, this usually does not occur before training. Training a network requires modifying its adjustable parameters (weights and biases) so that the error between the predicted outputs and the desired outputs is as small as possible. The objective function, which is also called the loss function and is used to measure the training error, usually adopts the square loss measure for regression problems and the log-likelihood loss measure for classification problems. In a typical higher-level neural network using the supervised learning approach, there may be many processing layers with hundreds of millions of these tuning parameters (weights and biases), which are trained with hundreds of millions of samples with genuine labels.
Unsupervised learning is one type of machine learning method and can employ the "Deep Learn" concept to construct a higher-level neural network with more powerful capability for feature representation and abstraction. In many unsupervised learning methods, a deep belief network (DBN) is a more representative approach among these unsupervised learning. A DBN usually consists of many processing layers, and each processing layer contains multiple parameterized non-linear modules. DNB has a range of advantageous properties that can allow it to efficiently process large-scale and high-dimensional datasets. The main goal of a DBN is to learn a higher-level feature abstract representation that provides good classification performance. Actually, after network training is complete, the DBN eventually derives a set of vectors of parameters (optimal weights and biases) that are employed as initialization values of the network parameters of the multilayer perceptron to train a final model. These optimal parameters determined by the DBN, employed to initialize a deeper network, may result in a training network via which the model has more opportunities for finding the global optimal solution and avoiding the case in which the algorithm procedure becomes trapped in a local minimum. In addition to the original DBN algorithm, some improved algorithms have been developed [37][38]. The creation of an autoencoder is also a well-known unsupervised learning approach that can form a deeper network structure by stacking a group of independent autoencoders. Training a deep autoencoder is similar to a DBN in that a layer-by-layer search procedure is performed with hundreds of millions of unlabeled samples.
Using machine learning or deep learning approaches to solve a pattern recognition or time series prediction task usually requires constructing a more appropriate network structure based on the properties of the dataset and the data representation type. The network structure design requires considering the depth of the neural network and the number of hidden neurons. These network configurations are very influential factors on the performance of the network. In addition, training a deep learning architecture may be affected by the choice of hyperparameter configuration using the steepest gradient descent algorithm. Generally, proper adjustment of the weight depends on the gradient vectors calculated by the learning algorithm and the learning rate that controls the variation amplitude of the parameters. Therefore, selecting a more appropriate learning rate and other hyperparameters plays an important role in the network training phase and final constructed model. These hyperparameter configurations cannot be optimized by the steepest gradient descent algorithm. In addition, finding an optimal set of values for the hyperparameter configuration is a challenging task due to the large number of optimization variables and the complexity of the problem. However, in recent years, researchers have constructed a huge number of experiments to find various rules of thumb for the choice of hyperparameter configurations. These useful tricks can help improve the performance of deep learning approaches. For more details, see [39][40][41][42], which provide a variety of practical tricks for selecting the appropriate hyperparameter configuration. There are several other crucial hyperparameters that can produce effects on network training: dropout rate, momentum, decay, and the number of hidden neurons. In addition, weight initialization is always given by a randomly generated real-number vector that is small enough around the zero to yield the largest gradients during the early training phase. The learning rate is typically a hot topic and has been discussed in the machine learning community. The main reason is because it plays a more important role in network training compared to other hyperparameters. To find a more appropriate learning rate, there is a simple solution for choosing a fixed learning rate, that is, we can simply use a grid search approach in combination with a network training procedure using steepest gradient descent algorithms to determine an optimal learning rate from several candidate log-spaced values (10 −1 , 10 −2 , . . ..), based on the training error. In addition, a classical MLP classifier using the traditional steepest gradient descent algorithms, such as a back-propagation algorithm for training a deep learning architecture, has several limitations, i.e., the algorithm cannot find the global optimum solutions and its search procedure is easily trapped in local optima. Therefore, to solve these difficulties in using steepest gradient algorithms to train a higher-level neural network, the Stochastic Gradient Descent (SGD) algorithm has recently been proposed [43][44][45][46][47][48]. The SGD algorithm has several advanced advantages and therefore provides an efficient and practical solution for training a deep learning architecture. The properties of the SGD algorithm allow it only to optimize an objective function based on gradient information, and it is not able to process the hyperparameter configuration of deep neural networks. To solve the parameter estimation, many population-based stochastic search algorithms have been employed, including genetic algorithms (GAs) [49], particle swarm optimization (PSO) [50], differential evolution (DE) [51], fruit fly optimization (FOA) [52], and ant colony (AC) optimization [53]. Particle swarm optimization (PSO) is a typical swarm optimization algorithm and has shown impressive search performance for parameter optimization on a broad range of real-world applications. For example, Gaing Z L et al. [54] employed particle swarm optimization to determine the optimal proportional-integral-derivative (PID) controller parameters of an automatic voltage regulator (AVR) system. In this work, the PSO algorithm has been demonstrated as a more efficient and robust tool for improving the step response of an AVR system. Park J B et al. [55] presented a new approach based on the PSO algorithm for solving economic dispatch (ED) problems with nonsmooth cost functions; in this work, a modified PSO mechanism was proposed to address ED problems, and the experimental results demonstrated the superiority of the modified PSO algorithm compared to other evolutionary algorithms. Esmin A et al. [56] employed the PSO algorithm for a loss reduction study; in this work, the PSO algorithm was demonstrated to have promising results when applied to an IEEE-118-bus system. Ting T et al. [57] proposed a new hybrid particle swarm optimization (HPSO) for solving the unit commitment (UC) problem. The proposed hybrid algorithm used binary PSO and real-coded PSO to respectively process the UC problem and the economic load dispatch problem simultaneously. Ishaque K et al. [58] developed an improved maximum power point tracking (MPPT) approach based on a modified PSO algorithm for photovoltaic (PV) systems; this method can reduce the steadystate oscillation once the maximum power point (MPP) is located, which shows promising results compared with other existing methods. As shown by the above PSO-related work, the PSO algorithm has been successfully applied in a wide range of domains, such as the parameter estimation of control systems, economic problems, and other real-world applications. In this study, the PSO algorithm is presented as an ideal option for finding the hyperparameter configurations of deep learning architectures since its properties allow the particles to preserve the best previous experiences (important information regarding hyperparameter configurations) over generations. In this work, we propose an efficient approach using particle swarm optimization (PSO) in combination with the steepest gradient descent algorithm (gradient descent algorithm) to determine the optimal network structure and hyperparameter configuration before training the final model. The main contributions of this study are summarized as the five aspects below: 1. In this study, we propose an efficient approach that utilizes the advantages of the global and local exploration capabilities of the PSO algorithm and steepest gradient descent algorithm to automatically discover a more appropriate network structure with a better hyperparameter configuration for final neural network training.
2. In the proposed approach, the four crucial hyperparameters (learning rate, dropout rate, momentum, and weight decay) and the number of neurons of each hidden layers are considered for optimization. We design a simple parameter representation that encodes the network configuration (network structure and hyperparameters) as a real-number vector as the individuals of PSO in the search process such that real numbers can be efficiently processed.
3. The proposed approach can provide a flexible method to construct an ensemble model and a well-performing DNN classifier by using the final solutions of the PSO algorithm to initialize DNN classifiers and train them with their corresponding hyperparameter configurations on the entire training data. Specifically, the local best (pbest) and global best (gbest) solutions of the PSO algorithm are employed to construct the ensemble model and the individual DNN classifier, which maximize the generalization capability and efficiency, respectively. In addition, the flexibility of the proposed approach allows for any number of classifiers to be combined to form an ensemble based on their scores, which are the last training accuracies during the training phase. This process directly chooses a certain number of DNN classifiers with the highest scores from the candidate classifiers trained by the network configurations expressed by the solutions (pbest) of the PSO algorithm without training any new DNN classifiers.
4. In this study, we have evaluated the performance of the ensemble model and the individual DNN classifier that are generated by the proposed approach. The empirical results demonstrate that the proposed approach of using PSO in combination with steepest gradient descent algorithms can maximize their local and global exploration capabilities and find optimal solutions that lead to better performance for both network training and the final models.
5. This study has investigated the influence of various ensemble models with different numbers of classifiers and depths of the neural networks.
The rest of this paper is organized as follows. A brief introduction to artificial neural networks is presented in section 2. A detailed description of our proposed approach using PSO in combination with steepest gradient algorithms to optimize deep learning architectures is presented in section 3. The detailed experimental results of using the proposed approach with various parameter configurations are reported in section 4, and we also investigate the effects of the ensemble model with different network depths and different numbers of combined classifiers. Finally, the conclusions are illustrated in section 5, and we also discuss future directions.

Background materials
In this section, we provide the details of the deep learning architectures and hyperparameter configuration. The contents of this section are organized as follows. Subsection 2.1 presents a brief overview of deep learning architectures. Subsection 2.2 describes the details of the Stochastic Gradient Descent (SGD) algorithm and network training.

A brief overview of deep learning architectures
In this subsection, we will describe the MLP classifier in detail. MLP is one of the most popular and classical machine learning approaches and is inspired by the neurotransmissions of the human brain. MLP can be regarded as a combined model associated with an artificial neural network (ANN) with many hidden layers and neurons, which has been successfully applied in computer vision and other real-world applications. A classical ANN classifier is generally comprised of three connected layers (one input layer, one hidden layer, and one output layer). The number of neurons of the input layer is fixed according to the size of the actual input data, and the number of neurons in the output layer matches the size of the actual outputs. Many hidden layers with hidden neurons can construct a flexible neural network, and a more complicated network with a huge number of hidden layers and many neurons usually requires a huge number of training samples and more computational energy and time for training. When a new sample is fed into a network, the input layer first receives the original data and makes a sum of activations with respect to a hidden neuron; the sum is then converted to a hidden neuron's output activation by a nonlinear function, which is defined as follows: In Eq (1), where o denotes the sum of the input data with respect to the weights and biases of the j-th neuron, f j (.) is the hyperbolic tangent function that calculates the activation value of the j-th neuron, I i = (I 1 , I 2 , . . ., I n ) is the input data of a single sample, and W ij = (W 1j , W 2j , . . ., W nj ) is a weight vector of the j-th neuron of the hidden layer. A simple MLP learning architecture generally consists of three connected layers, including one input layer, one or more hidden layers, and one output layer. To address the regression problem using the MLP classifier, the most popular performance criterion (also called cost) is the mean square error (MSE), which is defined as follows: where Y and Y denote the actual output value and the predicted output value, respectively, and n denotes the number of instances. To train an MLP classifier, the frequently used learning technique is the back-propagation algorithm, which dynamically updates the parameters (weights and biases) in the direction that the gradient descent aims to find the optimal parameters. To address a classification problem using an MLP classifier, the common performance criterion is the mean squared error, which is shown as follows: where E is the cost measure between the actual labels and the outputs, which is employed to calculate the gradient; Y i and Y i denote the actual label and the predicted label of the i-th instance, respectively; and n is the number of instances.

A brief overview of SGD and network training
In most cases, the learning architectures used in training are mostly used to solve classification problems. During the training phase with the supervised learning method, a DNN classifier generally requires an objective function to evaluate the training error. The common function is zero-one loss, which minimizes the number of errors on training samples. It has a simple form: g loss ¼ Because the zero-one loss function is not differentiable, optimizing it is prohibitively expensive when a large model with a huge number of parameters (weights and biases) is being trained. Therefore, the original formula can use the log-likelihood form as follows: g loss ¼ X n i¼0 log PðY 6 ¼ y i jx i ; yÞ. Generally, we aim to solve the minimization of a loss function instead of maximization. We use the negative log-likelihood as a loss function in the training of a deep learning architecture. The negative log-likelihood function in the classifier is differentiable. This means that the gradient of the loss function over the training data can be used as a signal in supervised learning of a DNN classifier. During the training phase, some gradient descent algorithms are employed to adjust the parameters by making small steps to minimize the error of a loss function. The ordinary gradient descent algorithm generally has a simple form, but it usually provides significant performance when training a neural network. The stochastic gradient descent (SGD) algorithm is similar to the principles of the original gradient descent, but it gains more benefits by calculating the gradient from just a few samples at a time instead of the entire training data.
In addition, the variant in the stochastic gradient descent algorithm is the adoption of the "minibatches" concept. SGD using the mini-batch method is similar to the original SGD. The difference is that the mini-batch technique used in SGD can help to reduce the variance in the estimate of the gradient and can work better in the hierarchical memory organization of powerful computers. In addition, when we train a deep neural network, the training processing may overfit the training data. The regularization concept is used to combat overfitting, and server techniques have been proposed. L1 and L2 regularization are common approaches that add an extra term in the loss function for the purpose of penalizing certain parameter configurations. The other approach is called "Early stopping," which is employed to address overfitting by observing the classifier's performance on an independent validation set.

Methodology
In this study, we present a new deep learning approach that automatically determines one or better hyperparameter configurations of the neural network to further improve the classification performance and generalization capabilities on various pattern recognition and regression tasks. Because network structure (different numbers of processing layers and neurons) and hyperparameter configuration play important roles in the training phase of deep learning architectures, various network structures and hyperparameter configurations employed to train a network may result in the derivation of a set of models that generally have different performances on pattern recognition tasks. Finding an appropriate network structure and hyperparameter configuration for a deep learning architecture is a difficult challenge due to the complexity of deeper networks and the high-dimensional optimization parameters. In addition, an individual classifier (network) employed to predict results on a large-scale and highdimensional dataset has several limitations, including weak generalization ability and instability in the training phase. To solve the above issues and further improve the performance of deep learning architectures in pattern recognition and regression tasks, the proposed approach mainly aims to construct a robust and efficient model by using a more appropriate network structure and optimal hyperparameter configuration determined by an efficient optimization scheme using particle swarm optimization (PSO) and steepest gradient descent algorithms. During the optimization scheme, the advantages of the global and local search capabilities of PSO and steepest gradient descent algorithms can provide a powerful and efficient search process for finding the best network structure and hyperparameter configuration. Fig 1 shows the basic framework of the proposed approach. As shown in the graph, the framework consists of three independent modules. The first module is called the basic element model and defines the various network structures and optimization parameters with their search domains. The second module is called the generator and is mainly responsible for generating various deep learning architectures depending on different network configurations (numbers of processing layers and neurons and network type). The third module is called the optimization scheme and implements the cyclic process to determine the final model by using PSO in combination with steepest gradient descent algorithms determining the best solution (best network structure and hyperparameter configuration). The remainder of this section is organized as follows. Subsection 3.1 presents the detailed basic element model and generator. Subsection 3.2 provides a brief overview of the PSO algorithm, and we also provide a coding design for hyperparameter representation. Subsection 3.3 describes the details of the ensemble model with multiple combined classifiers. Subsection 3.4 presents an efficient and robust optimization scheme using PSO and steepest gradient descent algorithms.

The basic element model and generator
The basic network configuration comprises a neural network structure and a hyperparameter configuration. In the proposed approach, the basic deep learning architecture adopts the classical multilayer perception (MLP), which generally consists of one input layer, two or more processing layers, and one output layer, where neurons in the input layers are equal to the size of the input features, neurons in the processing layers generally can be any number, and neurons in the output layer are equal to the number of categories for the classification model and equal to 1 for the regression model. The proposed approach can provide a flexible way to initialize the network for classification and regression problems. In addition, the prediction performance of the generated deep learning architecture extremely depends on the network training using the hyperparameter configuration. The choice of hyperparameter configuration directly influences the performance of the steepest gradient descent algorithm in the training phase of the neural network, which is a very important issue in our study. The difference between training a deep learning architecture and a shallow learning architecture is that a deep learning architecture requires the initialization of more hyperparameters before training. The neural network structure and hyperparameter configuration are usually not fixed because these configurations depend on the properties of the datasets. We can select more hyperparameters as optimization parameters to be searched using optimization algorithms to derive a well-generalized model, but the complexity of the parameter searching process is considerable. Selection of a few hyperparameters as optimization parameters may result in the derivation of a model that only yields slight improvements in performance compared to the normal method. Therefore, the proposed approach only selects several crucial hyperparameters as optimization parameters due to their impacts on network training. These important optimization parameters are learning rate, dropout rate, decay, momentum, and the numbers of processing layers and neurons, which play an important role in the training phase. In addition, searching for these optimization parameters employed to construct a deep learning architecture requires an appropriate search range for each optimization parameter, which aims to avoid the occurrence of wrong numbers and ensure that each searched parameter is always reasonable and correct. Moreover, the search ranges for optimization parameters can also help to reduce the computational time of the search procedure because we can set a small search range for each optimization parameter so that the PSO algorithm only requires a few search iterations to determine the optimal solution. The values of these search ranges of optimization parameters are dependent on the properties of the dataset (the size of input features and number of training samples) and optimization tasks (classification or regression problems). The generator is employed to generate a network configuration containing a network structure and a hyperparameter configuration in a random manner. In this generated model, each of its hyperparameters is given by a random value corresponding to the parameter domain.

A brief overview of PSO and its coding design for hyperparameter representation
In this section, we present a brief overview of the particle swarm optimization algorithm and a simple coding design for a representation of the network structure and hyperparameter configuration. In recent years, a variety of population-based intelligent algorithms inspired by biological mechanisms have been developed to solve a variety of complex problems and successfully applied in real application tasks, including medicine, engineering, computer science, and finance problems. Among most intelligent algorithms, swarm intelligence can be considered one type of artificial intelligence concept or technique that was inspired by the natural phenomenon of a flock of birds searching for food sources by changing their locations based on their former position and swarm position. Particle swarm optimization (PSO) is one of these swarm techniques and was first introduced by Kennedy J, Eberhart R in 1995 [1] [59]. Particle swarm optimization is similar to other population-based meta-heuristic optimization techniques in that it first initializes a group of individuals as a population and then updates the information (state) of these individuals by an evolution process. The advantages of the PSO algorithm compared with other swarm intelligence algorithms is that the PSO algorithm generally contains a simple and efficient search process, is easy to implement, and can efficiently find global optimal solutions that are closest to the actual solutions.
Particle swarm optimization employs particles as population members, and each particle (individual) is expressed by an m-dimensional real-number vector. During the evolution process of the PSO algorithm, each particle of a particle swarm (population) is considered to be a representation of a possible solution in a finite search space (m-dimensional search space). PSO first initializes a group of particles as a population in a random manner. After the initialization of the PSO population, an evolution procedure is performed with a certain number of generations, and during each generation, each particle (individual) finds a possible optimal solution by changing its direction depending on the two crucial factors of position and velocity of the individual best previous experience (pbest) and the best previous experience of all individuals/swarm particles (gbest). The details of the velocity and position update of an individual can be seen in Eqs (4) and (5), where t and t + 1 denote the generations (iterations), d denotes the number of dimensions of the particle, X t (i) denotes the position in the i-th dimension of the particle at generation t, and V t (i) denotes the velocity of the i-th dimension of the particle at generation t + 1. R 1 and R 2 are randomly generated values in the domain of [0, 1]. W denotes an inertia weight that was first proposed by Shi and Eberhart [1] [57]. C 1 and C 2 are positive acceleration coefficients, which are also called cognitive and social parameters due to their role in the algorithm evolution procedure. In fact, these two important parameters are mainly employed to control the balance of an individual's self-learning versus learning from the entire PSO population.
To further improve and balance the relationship between the local exploitation and global exploitation, we use the time-varying acceleration coefficients (TVAC) [60,61] and time-varying inertial weight (TVIW) [60,61,62]; the effectiveness of using TVAC and TVIW techniques on the acceleration coefficients and inertial weight have been verified. These two approaches dynamically update the acceleration coefficients and inertial weight during the iterations and can help the original PSO algorithm perform better in determining the region of the global solution and avoiding the case of the algorithm search procedure becoming trapped in local minima [60,61,62].
When using the TVAC approach, the acceleration coefficients C 1 and C 2 are adjusted based on the initial values of the acceleration coefficients C 1i and C 1f and the current iteration. The details of the acceleration coefficient update process are shown in Eq (6), where t and t max denote the current generation (iteration) and the maximum number of generations, respectively. In addition, the TVIW approach is employed to change the inertial weight during the evolution process. The details of the inertial weight W update are seen in Eq (7), where W max and W min denote the maximum and minimum values of the inertial weight. The TVIW approach can efficiently balance the global exploitation and local exploitation of the PSO algorithm, that is, a large inertial weight W may allow the PSO algorithm to exhibit better global search capability at the beginning of the algorithm procedure, and the local search ability of PSO algorithm is gradually increased by gradually decreasing the inertial weight W in a linear manner during the algorithm evolution procedure.
From these above equations, the initial values of the inertial weight W max and W min W max are usually constant values. The initial values of the acceleration coefficients C 1i , C 1f , C 2i , C 2f are set to constant values.

Combining the evidence of multiple classifiers
In this subsection, we present an efficient and configurable ensemble model whose members (sub-classifiers) can be constructed by any deep learning architecture. Because combining the evidence of multiple DNN classifiers may provide better generalization performance than an individual DNN classifier but requires more computational times to implement all of the training processes, we propose an efficient and flexible approach to build an ensemble model that aims to minimize the model complexity and does not deteriorate the generalization capability. To construct such an ensemble model, we directly choose a certain number of DNN classifiers with the best scores (training accuracies calculated on an independent validation dataset using the trained DNN classifiers) from the entire set of DNN classifiers that are initialized and trained by the final solutions (pbest) of the PSO algorithm without training any new DNN classifiers. Let C = {C 1 , C 2 , C 3 , . . ..} be a set of DNN classifiers that were trained with the optimal hyperparameter configurations using the steepest gradient descent algorithm on the entire training dataset. Let S = {s 1 , s 2 , s 3 , . . ..} be a set of scores corresponding to the set of DNN classifiers C. Let E = {e 1 , e 2 , e 3 , . . ., e h } be a subset of C and its members be selected based on their scores. h is a threshold value employed to control the number of members of an ensemble model. After an ensemble is generated, a fusion function with a majority vote rule is employed to calculate a final output. The detailed calculation is presented as follows.
. . . ; o i c g be a binary decoded for an output of the i-th sample of a DNN classifier; then, the classifier was trained with ( where a denotes the number of categories. During the prediction phase, an observation will be classified as category t when o i t > o i j , for all j 6 ¼ t, 1 j a. To predict an output using the ensemble model, the combination of evidence over all classifiers uses a fusion function with a majority vote rule that calculates the outputs based on the decision made by most of the members. Then, we defined a function f j ¼ maxfo j 1 ; o j 2 ; . . . :; o j a g to calculate an output of the j-th classifier. An observation will be classified into category t when f t > f j , for all j 6 ¼ t and 1 j a. Combining evidence from multiple DNN classifiers would generally result in construction of a well-performing model for which some misclassified observations are ignored by most correctly labeled observations.

The optimization scheme using PSO and steepest gradient descent algorithms
In the previous subsections, we have described the details of the search procedure and coding design of the hyperparameter configuration of the PSO algorithm. In this subsection, we present an efficient and robust optimization scheme based on a hybrid search approach using PSO and steepest gradient decent algorithms. The optimization goal of the proposed scheme is to automatically determine one or better network configurations (network structures and hyperparameter configurations) before using the deep learning architecture in applications. Because the performance and generalization capability of a deep learning architecture extremely depend on the network structure and hyperparameter configuration during the training phase, the choices of network structures and hyperparameter configurations employed to train a deeper network play an important role in our proposed approach. In addition, our proposed approach provides a configurable and flexible method for implementing a parameter-searching process. Any optimization algorithms can be fitted in these interfaces to implement their population initialization, population evaluation, and location updating. Fig 2 displays the details of this parameter-searching process, and as shown in the graph, the entire search process is comprised of 3 independent interfaces. These implemented interfaces are the parameter initialization interface responsible for population or parameter initialization of algorithms; the update interface provides a notice for the state or location update of algorithms after the evaluation interface has been utilized; the evaluation interface is mainly responsible for calculating the scores of multiple network configurations. In this manner, the particle swarm (individuals) of the PSO algorithm can convert their information to network configurations to obtain the scores that are employed to evaluate the population; finally, these operating interfaces are designed to correspond to the primary procedure of the optimization algorithm, and a main search process is performed in which these operating interfaces are performed in a sequence until termination of the procedure. Therefore, we can use other advanced population-based stochastic search algorithms to implement this optimization algorithm. The main reasons for using the PSO algorithm without other optimization algorithms are its powerful search capability for determining the global optimum and its convenient representation of continuous variables. Furthermore, the advantages of the global and local search capabilities can be explored by a hybrid approach using PSO in combination with steepest gradient descent algorithms. The main idea of the hybrid approach is that the PSO algorithm is employed to search for a neural network and hyperparameter configuration by adjusting the locations of the particle swarm, and then each particle (individual) representing a prototype is converted into a network configuration to initialize a classifier after performing a training procedure with a small step of mini-batch learning using the steepest gradient descent algorithm. After the network training, the last training loss value or training accuracy, depending on the independent validation set, is employed as the score (fitness value) for the individual. The basic process of the optimization scheme using PSO and the steepest gradient descent algorithm is summarized by five independent steps as follows: (1) Initializing parameters for PSO and the optimization scheme: Similar to the other population-based stochastic search algorithms, the PSO algorithm requires initialization of the population size, the maximum number of iterations, the acceleration coefficients C 1i , C 1f , C 2i , C 2f , and the inertial weight W min , W max . To initialize parameters for the optimization scheme, the domains of the optimization parameters must be initialized. These optimization parameters are learning rate, dropout rate, momentum, decay, and the numbers of neurons of all processing layers. The lower and upper bounds are used to define the domain for each optimization parameter, which can ensure that each optimization parameter is searched for in its corresponding search range. In addition, the epochs for the validation phase and final training phase usually set small and large values to reduce the computational time spent evaluating the PSO individuals (particle swarm) representing classifiers and ensure that the deep learning architecture fits the training samples before using them for any pattern recognition tasks. The next step is population initialization of the PSO algorithm.
(2) Population initialization of the PSO algorithm: In the proposed optimization scheme, each particle (individual) of the PSO algorithm is used to represent a deep learning architecture with a hyperparameter configuration. To initialize the population of the PSO algorithm, each dimension of a particle denotes an individual optimization parameter of the network configuration and is generated by a random real number in its corresponding parameter domain. After the population is initialized, we can obtain a set of candidate network structures and their corresponding hyperparameter configurations, which are expressed by population. Therefore, we can evaluate the population of the PSO algorithm by training these deep learning architectures with their corresponding hyperparameters and then validating them on the independent validation set. The details of these training and validating processes are presented as follows: (3) Population evaluation using the steepest gradient descent algorithm: This step is mainly responsible for evaluating the population of the PSO algorithm by training and validating the neural networks using the steepest gradient descent algorithm. In the evaluation phase, each particle (individual) can be viewed as a representation of a potential solution and thus is transformed into a network configuration. Then, the generator initializes a deep neural network according to the network configuration. After all the neural networks are initialized, these DNN classifiers are then trained using the steepest gradient descent algorithm with a few steps of mini-batch learning processes on an independent training subset that is randomly collected from the entire training dataset according to a predefined threshold value. After all DNN classifiers are trained, the scores of the individuals are calculated by evaluating these trained DNN classifiers by predicting the output on an independent validation dataset. The details of the training and validating processes using the steepest gradient descent algorithm are displayed in Fig 3 and can be described as follows. Let D denote the entire training dataset and be randomly divided into two independent sets D = {Tr, Te} by a threshold λ, where Tr and Te denote the independent training and validation sets, respectively, and λ is employed to control the size of the training set and is usually set to 0.8. Let P = (P 1 , P 2 , . . .. . ., P n ) denote the population of the PSO algorithm and P i = (C 1 , C 2 , . . .. . ., C m ) denote the i-th particle (individual) consisting of m optimization parameters. After initializing the population, each particle P i is employed to construct a deep learning architecture and its corresponding hyperparameter configuration, which are denoted as P i = {Net, C}, where Net and C denote the deep learning architecture and hyperparameter configuration, respectively. Then, the deep learning architecture Net is trained with the mini-batch learning method using the training parameters C on the independent training set Tr several times. After network training, we can calculate the training accuracy based on the independent set Te or the last loss value after network training as a score (fitness value) for the individual P i of the PSO algorithm. After calculating the scores for individuals, the local best experience (score and location) of the individual can be determined by comparing the current score of the individual P i and the best score from its previous experiences in past generations and replacing the previous local best experience by the current best if the current score is large than the previous local best score. Similar to the local best experience, the global best experience is determined by comparing the current score of the individual P i and the best score from the previous experiences of all individuals in past generations and replacing the previous global best experience with the current best if the current score is large than the previous global best score. After determining the local and global best experiences (scores and locations), Eqs (2) and (3) are employed to update the velocities and locations of the particle swarm of the PSO algorithm. The next step is checking the algorithm termination as follows: (4) Checking algorithm termination: This step is mainly responsible for checking the algorithm termination (if the number of iterations has reached the maximum iteration). If the condition is satisfied, then the algorithm stops and we go to step (6); otherwise, we continue to step (5).
(5) Training a final model and evaluation: After the algorithm search procedure, we can obtain a set of local solutions (pbest) and the global best solution (gbest), and we then use them to initialize a set of optimal classifiers and an individual classifier, respectively. Then, each classifier is trained by a training procedure with a certain number of mini-batch learning iterations on all of the training samples. During the training phase, the classifier has a corresponding hyperparameter configuration that is employed in the steepest gradient descent algorithm to adjust the network parameters (weights and biases) to determine the best solution that allows the training error or loss value between the input pattern and output pattern to be as small as possible. After all classifiers have been trained, we can obtain an individual trained classifier that is initialized by the global best solution (gbest) of the PSO algorithm and a set of optimal trained classifiers that are initialized by the global best solution of the PSO algorithm. In the performance evaluation phase, we evaluate the performance of the proposed approach by predicting all testing samples using a combined model (ensemble model) of multi-classifiers with a majority vote strategy and an individual classifier. In addition, the proposed approach provides a flexible way to construct an ensemble model for a given number. The main idea of using a certain number of classifiers to form an ensemble model in our approach is to directly select classifiers from the generated classifiers that have been initialized by the local best solutions (pbest) and trained on all of the training samples, without training new classifiers. As a result, a light ensemble model with a small number of classifiers can be constructed by selecting a few optimal classifiers whose final training loss values or training accuracies, depending on the independent validation set, are superior to the remaining classifiers. In this manner, a light ensemble model can reduce the computational time in predicting a huge number of samples and may maintain the original classification performance and generalization ability. In the following experiments, we have also performed several tests using these ensemble models with different numbers of sub-classifiers to investigate the relationship between the performance metrics (prediction and generalization performance) and the number of classifiers.

Experimental studies
In this section, we constructed several experiments to evaluate the performance of the proposed approach and investigate the influences on performance when changing the deep learning architecture and hyperparameter configuration. More specific experimental contents are organized as follows. The details of the experimental dataset, experimental setting, and data pretreatment are described in subsection 4.

The datasets, experimental setting, and data pretreatment
In this subsection, we provide the details of the datasets, experimental setting, and data pretreatment. The proposed approach mainly solves classification and regression problems. To evaluate the performance of the proposed approach on a classification problem, the MNIST dataset [63] is used. The MNIST dataset is the most popular and frequently used dataset employed to evaluate the performance of various machine-learning algorithms. The main goal of the use of the MNIST dataset is to recognize the handwritten characters from the digital images. The MNIST dataset is also a well-known standard benchmark used to evaluate the performance of machine learning approaches. Each digital image of the MNIST dataset is expressed by a real-number matrix comprised of image pixels (

Results and analysis for the MNIST dataset
In this section, we evaluate the classification performance and generalization ability of the proposed approach in solving the recognition of handwritten characters. In this experiment, two different methods were used to construct the final classification models. One approach is called "DNN-NONPSO" and randomly generates a group of network configurations (network structures and hyperparameter configurations) from the domains of the parameters and uses them to directly train a set of final deep learning architectures with their corresponding hyperparameter configurations on all of the training samples. The second approach is called "DNN-PSO" and first implements an optimization scheme in which the network configuration is decoded into a real-number vector and employed as a particle (individual) of the PSO algorithm so that the algorithm search procedure can efficiently process these optimization parameters. In addition, after once updating the particle swarm information, a particle (individual) of the PSO algorithm is converted into a network configuration and employed to initialize a deep learning architecture and train the network with a few steps of mini-batch learning to obtain the final training loss value as the score of the individual. Finally, the determined optimum solutions are employed to construct the final models. The detailed parameter configurations for the PSO algorithm and optimization scheme are presented as follows. The population size is set to 20, and the maximum number of iterations of PSO is set to 30. The acceleration coefficients C 1i , C 1f , C 2i , C 2f , are set to 2.5, 0.5, 0.5, and 2.5, respectively, and the inertial weights W min , W max are set to 0.4 and 0.9, respectively, according to the recommendations. The domains of the network structure and hyperparameter configuration are set as follows. The search ranges of the learning rate, decay, momentum, and dropout rate are set to To evaluate the performance of the proposed approach on the MNIST dataset, the k-fold cross-validation technique is used. The k-fold cross-validation technique is the most popular and frequently used method and is mainly employed to evaluate the performance of different algorithms in an unbiased manner. The main purpose of using k-fold cross-validation in this experiment is that the k-fold cross-validation technique has been widely applied in most studies and hence provides a standard benchmark for evaluating our approach. Additionally, it can also ensure that the experimental results that are not affected by other factors. The main idea of k-fold cross-validation is to randomly split the original dataset into k independent subsets. These k subsets share the same information with all categories and contain the same number of instances. The entire procedure of k-fold cross-validation requires k independent algorithm runs, and in each k-fold cross-validation run, one of the k subsets is selected as a test set for classifier evaluation, and the remaining k-1 independent subsets are employed as a training set to model a classifier. During all k runs of k-fold cross-validation, each subset has a chance to be selected as the validation set, with the remaining k-1 subsets being employed as the training set. After all k-fold cross-validation runs, the achieved results are averaged. In the handwritten recognition experiment, 5-fold cross-validation is adopted to randomly split the MNIST dataset, which originally consists of 6000 training samples and 10000 testing samples, into 5 independent subsets such that each subset is comprised of 14000 samples and shares information with ten categories. In each 5-fold cross-validation run, one of the 5 independent subsets is selected as the test set, and the remaining 4 independent subsets are employed as the training set. After all 5-fold cross-validation runs, we can obtain the average and standard deviation of the achieved results from the five instances of 5-fold cross-validation. Table 2 shows the classification accuracy of candidate classifiers generated by solutions (network configurations) from the proposed approach using the PSO algorithm and random manner for five runs of 5-fold cross-validation on the MNIST dataset. As shown in the achieved results in rows 1-20 of the table, we can observe that the use of different candidate solutions (network configurations) for training of the deep learning architectures usually derives a different classification model with different generalization performance. In addition, to compare the performance of the DNN classifiers generated by solutions of DNN-PSO and DNN-NONPSO, the solutions generated by DNN-PSO, which are converted into network configurations to train DNN classifiers, are superiority to the solutions generated by DNN-NONPSO. This indicates that the PSO algorithm in combination with the steepest gradient descent algorithms, which results in the combination of their two advantages of global and local global exploration abilities, can usually determine a more appropriate network structure and hyperparameters for the individual DNN classifier. These determined optimal network configurations usually perform better in DNN network training, and the DNN classifiers trained by these hyperparameters would provide better generalization ability than the random method. The last two lines display the average and standard deviation of the classification accuracies and support the above conclusion. Table 3 displays the prediction results of the final classification models (the ensemble model and the individual DNN classifier) constructed by the proposed approach respectively Table 2  The proposed approach using PSO The proposed approach without PSO The main reason may be because combining evidence across multiple DNN classifiers of the ensemble model may result in the construction of a classification model in which some poorly performing DNN classifiers would be ignored by most of the well-performing DNN classifiers during the prediction phase. The more specific case can be seen in the ensemble without using the PSO-based optimization scheme. Table 3 Table 3. It can be seen that training of an ensemble model needs to almost speed computational resources 20 times than the individual DNN classifier. More DNN classifiers are trained to form an ensemble model that would require more computational resources. Moreover, it can also be seen that training an ensemble model and an individual DNN classifier using a PSO-based approach requires less computational time for training the final models than the DNN-NONPSO approach. This may be because a more appropriate network structure with a better hyperparameter configuration not only derives a well-performed DNN classifier with better generalization capability but also accelerates the convergence of the training phase. Table 4 presents the detailed experimental results of the ensemble models when combining different numbers of DNN classifiers for five runs of 5-fold cross-validation on the MNIST dataset. The table presents results achieved by the ensemble models, which are constructed by directly choosing a certain number of DNN classifiers from the 20 independent DNN classifiers that have been initialized and trained by the solutions representing network configurations generated by using DNN-PSO and DNN-NONPSO approaches, respectively, without training new neural networks. The achieved results indicate that combining evidence from n trained  Fig 4 displays the fitness values generated by the PSO algorithm during the evolutionary procedure. As shown in these graphs, in all five-fold runs, the fitness curves increase with increasing generations until a certain number, demonstrating that the PSO algorithm is able to find better solutions by its evolutionary procedure. Fig 5 displays the training errors (training loss values) generated by the individual DNN classifier initialized by the solution of the PSO-based optimization scheme and random method, respectively. As shown in these graphs, all training loss values generated by five runs of 5-fold cross-validation indicate that the individual classifier initialized by the best solution (gbest) of the PSO algorithm and then trained with the solution (gbest) corresponding to the hyperparameter configuration can produce a decreasing curve during the 20 epochs. In addition, the training loss values generated by the individual classifier initialized and trained by the solution (gbest) of the PSO algorithm not only achieved the lowest loss value at each epoch compared with the random solution but also obtained the lowest loss value on the latest training phase. The observed results also indicate that a more appropriate network structure with a better hyperparameter configuration can improve the performance in the network training phase and of the final achieved model, and the PSO algorithm in combination with the steepest gradient descent algorithm can utilize their global and local exploration capabilities to automatically discover the optimal network configuration without any prior knowledge.

Results and analysis on the KCD dataset
In this subsection, we evaluate the prediction performance of the proposed approach in solving the regression problem. To solve the regression problem, the prediction of biological activity is an important issue in computational biology fields and helps researchers to improve their work on drug discovery. In this experiment, KCD datasets are suitable for evaluating performance of various algorithms and investigating the effectiveness of the proposed approaches on predicting biological activity. Before performing any training procedures on KCD datasets, we need to preprocess the data. The details of the above process are illustrated as follows. The high-dimensional descriptors (features) that exist in each sample are difficult for training of a deep learning architecture, and hence this study uses the Principal Component Analysis (PCA) technique to reduce the number of descriptors to an appropriate amount so that the deep learning architecture can efficiently process them. All 15 targets of the KDD datasets were processed with PCA for feature reduction. After the above process, the remaining descriptors of 15 targets of the KCD datasets are scaled into the range of (0, 1). For this regression experiment, all parameter domains were set as follows. The number of processing layers was two, and each processing layer's neurons were initialized by a randomly generated number in the range of (100, 150). The sigmoid function was employed to compute the activation value of the neuron. The number of epochs for population evaluation (performing small epochs of mini-batch learning to evaluate classifiers generated by the population of the PSO algorithm) was set to 30 and for final network training was set to 100. The batch size was set to 100. The domains of the learning rate, momentum, decay, dropout rate were set to the ranges (0.01, 0.1), (0.1, 0.9), (0.00001, 0.0001), and (0.1, 0.9), respectively. For evaluating the performance of the proposed approach on each biological dataset, a distribution ratio is used to randomly divide all of the samples of each dataset into two subsets, where 80% of the samples were employed to train a final network and the remaining samples were employed as a testing set for evaluation.
The performance of the 20 independent DNN classifiers generated by the solutions (pbest) of the PSO algorithm and random solutions for 15 datasets is presented in Tables 5, 6 and 7, providing the detailed experimental results for datasets 1-5, 6-10, and 11-15, respectively. It can be observed that DNN classifiers whose network structures with hyperparameters are determined using the PSO algorithm and trained using the steepest gradient descent algorithm yield better performance than the random approach. In addition, as shown in these tables, the DNN classifiers that were initialized and trained by the solutions generated by the PSO algorithm almost achieved the best prediction results in terms of the MAPE for each biological activation dataset. The last two rows of these tables show not only that the proposed approach using the PSO algorithm gives the best DNN classifiers with good classification performance but also that these DNN classifiers yielded the lowest standard deviation. Based on the above results, it is also shown that the use of PSO in combination with steepest gradient descent algorithms can determine more appropriate network configurations (network structure and hyperparameter configuration) for initializing and training the final model with good generalization performance to solve regression problems. Table 8 shows the prediction results of the ensemble model with different numbers of s using DNN-PSO and DNN-NONPSO approaches for 15 KCD datasets. As shown in the table, in most cases, combining more DNN classifiers may provide better prediction performance. In addition, selecting the 18 DNN classifiers as the members of an ensemble model may result in a prediction phase in which the outcomes generated by the ensemble model are only slight improvements compared with the rest of the models. Table 9 shows the prediction results of the ensemble model and the individual DNN classifier generated by the solutions (pbest) and (gbest) of the PSO algorithm and random approach, respectively, for 15 KCD datasets. The last two rows of the table present the average and standard deviation of accuracies generated by the two different approaches on 15 KCD datasets, respectively. We can observe that the ensemble model and the individual DN classifiers constructed by the PSO-based optimization scheme not only achieved better average prediction results but also yielded the lowest standard deviation of accuracy. In addition, using the solutions of the PSO algorithm to initialize the ensemble model and individual DNN classifier and then train these neural networks with their corresponding hyperparameter configurations requires less computational time to implement the final models compared to the random approach. Based on the above results, it is shown that the optimal network configurations can simultaneously improve the classification performance and training efficiency compared with randomly selected configurations. Figs 6, 7 and 8 respectively display the training errors of the individual classifier generated by the solution (gbest) of the PSO algorithm and random configuration for 15 KCD datasets. As shown in these graphs, gbest is employed to initialize a DNN classifier and then train the classifier with its corresponding hyperparameter configuration, which generates lower training curves than the random configuration. In addition, not only did the individual classifier using the DNN-PSO approach generate the lowest training curves, but it also obtained the lowest final training error in the last training phase.

The investigation of deeper networks using the proposed approach
In this subsection, we investigate the influence of the use of the PSO-based optimization scheme for deeper neural networks and evaluate the effectiveness of these trained DNN classifiers on the MNIST dataset. For the experimental parameter setting, the search ranges for the optimization parameters, and the parameter configurations for the PSO algorithm and final network training adopt the same settings as in the previous experiment. In addition, we adopted three hidden layers to construct a deep neural network, and the number of hidden neurons of each hidden layer was searched for in the same domain. The five runs of the 5-fold cross-validation procedure with mini-batch learning are performed to train DNN classifiers and evaluate them. After five runs, we can obtain the final results by calculating the average and standard deviation of these achieved results. Table 10 shows the detailed classification results of 20 independent DNN classifiers with three hidden layers generated by the solutions (pbest) of the PSO algorithm and random approach, respectively, for five runs of 5-fold crossvalidation on the MNIST dataset. The same phenomenon can be seen in this table: use of the optimal network configurations that were expressed by solutions of the PSO algorithm to construct DNN classifiers outperforms the classifiers trained using randomly generated configurations. The last two rows of the table show the average and standard deviation of these results within the table. To compare the classification accuracies generated by DNN classifiers with three and two hidden layers, the DNN classifiers perform better when using two hidden layers. This may be because a deeper-level neural network generally has a huge number of adjustable parameters (weights and biases), requiring more training time or computational resources to allow the model to fit the training data. In. The same training parameters (epochs and batch size) that have been employed in the training network with two hidden layers in the previous experiments are now used to train the network with three hidden layers. As a result, this training phase may be insufficient, and more epochs for network training may allow the deeper neural network to reach or exceed the classification performance of the shallow architecture. Table 11 shows the prediction results of the ensemble model and the individual DNN classifier generated by the two different approaches for five runs of 5-fold cross-validation on the MNIST dataset. As shown in the table, the solutions achieved by the PSO algorithm employed to construct an ensemble model and individual DNN classifier gave the best classification accuracies of 0.9845 and 0.9777, respectively. These trained ensemble model and individual DNN classifier also yielded the lowest standard deviation for accuracy. For the computation times in the training phase, the DNN-PSO is superiority than DNN-NONPSO. Based on the above results, we can conclude that the proposed approach using the PSO algorithm in combination with the steepest gradient descent algorithm can also determine the optimal solutions for the deeper neural network that achieved the significant results. In addition, to compare the results achieved by DNN classifiers with two and three hidden layers, the difference between them is smaller. This may be because the optimal network configuration can overcome the training of a deeper neural network using less training time. Table 12 shows the classification accuracies of the ensemble models with different numbers of DNN classifiers generated by the DNN-PSO and DNN-NONPSO approaches for five runs of 5-fold cross-validation on the MNIST datasets. As shown in the table, higher classification performance is achieved by increasing the DNN classifiers. This demonstrates the advantages of combining evidence across multiple independent classifiers for generalization performance. In addition, to compare the classification accuracies between the networks with two and three hidden layers, the DNN classifiers generated by the solutions (pbest) of the PSO algorithm provide significant improvements compared with the random approach, at each fold of the 5-fold cross-validation runs. The proposed approach provides a flexible manner by which the number of classifiers can be determined. In addition, constructing an ensemble model with a certain number of classifiers is directly choosing the optimal classifiers with better scores (low training errors) from all the trained classifiers. In this manner, the generalization capability can be maximized, while the computational time of the prediction phase can be reduced. Therefore, the choice of number of classifiers is a trade-off between performance and complexity. In the next subsection, we investigate the influence of the choice of number of classifiers and evaluate these ensemble models in terms of classification performance and computational time for the prediction phase. Fig 9 displays the fitness value (training accuracy) curves generated by the PSO algorithm at each iteration for five runs of 5-fold cross-validation on the MNIST dataset. As shown in these graphs, the fitness values gradually increase as the iterations increase, up to a certain number. This indicates that a good solution can be found during the PSO search processes. Fig 10 displays the training error curves generated by the individual DNN classifier using the DNN-PSO and DNN-NONPSO approaches. It can be observed that the DNN-PSO approach almost achieved lower training error curves than DNN-NONPSO during each fold of the 5-fold cross-validation runs.

The influence of the choice of number of classifiers
When constructing an ensemble model, the choice of number of classifiers is an important issue that influences the generalization performance. How to choose a more appropriate number of classifiers is a challenging task because more classifiers may yield better generalization performance but also require more computational resources in network training and prediction. This subsection investigates the influence of the ensemble models that were constructed with different numbers of classifiers using the DNN-PSO and DNN-NONPSO approaches. In this experiment, all parameter configurations are set the same as before expect for the population size (the number of all final classifiers) of the PSO algorithm. In addition, the various ensemble models were generated by directly choosing the classifiers from the candidate classifiers trained by pbest of the PSO algorithm and the randomly generated solutions on the MNIST dataset, without training any new neural networks. Fig 11 displays the classification accuracies achieved by the ensemble models with varying numbers of classifiers (from 2 to 59). As shown in the graph, higher classification accuracy can be obtained when more DNN classifiers are employed to construct the ensemble model. In addition, DNN-PSO almost yielded the best classification performance compared with DNN-NONPSO when the ensemble model was constructed by choosing the same number of classifiers. Fig 12 displays the computational times for the training phase of the ensemble models generated by using the DNN-PSO and DNN-NONPOS approaches. From the graph, we can observe that the computational times for the training phase of the ensemble model are gradually increasing as the ensemble model increases its number of members. Figs 13 and 14 respectively display the classification accuracy and computational time of the training phase of ensemble models with three hidden layers.
There are similar results to be found in these graphs. Based on the above results, it can be  concluded that more DNN classifiers employed to construct an ensemble model may provide better generalization capability, but the resulting long computational times in the training and prediction phases make this inapplicable in real-time applications. Actually, the highest classification performance has been achieved by a certain number of classifiers instead of all classifiers, because the improvement in performance of the ensemble model is smaller when increasing classifiers until a certain number of classifiers is reached. To investigate the influence of the use of different numbers and depths of classifiers when constructing the ensemble model, the same experimental processes as for the parameter configuration were performed.

Conclusion and future work
In this paper, a new automatic hyperparameter selection approach is proposed to find a more appropriate network structure and hyperparameter configuration for deep neural network training. The main ideal of the proposed approach is utilizing the advantages of global and local exploration capabilities from particle swarm optimization (PSO) and the steepest gradient descent algorithm and combining them into a hybrid search procedure. Because the performance of deep network classifiers extremely depends on their network structure and hyperparameter configurations, we aim to optimize these configurations through an efficient parameter optimization scheme using PSO in combination with the steepest gradient descent algorithm. After the procedure of the parameter optimization scheme, the final solutions (pbest) and (gbest) of the PSO algorithm having important network configuration information are used to initialize and train a final ensemble model and individual DNN classifier, respectively. In addition, the proposed approach also provides a flexible method that allows users to choose a certain threshold as the number of classifiers to construct a light ensemble model. In this manner, the trade-off between the generalization capability and the model complexity can be addressed, and this is discussed in the previous subsections in which several experiments have been performed with the ensemble models with different numbers of classifiers. We have constructed experimental studies to solve classification and regression problems by evaluating the performance of the proposed approach on the handwritten characters and biological activity prediction datasets, respectively. The experimental results demonstrated that the proposed approach can find a more appropriate network structure with a better hyperparameter configuration, which are then employed to initialize and train DNN classifiers, which achieve excellent performance in both the training phase and final models (after training). Therefore, the proposed approach can be regarded as an automatic hyperparameter optimization tool for deep learning architectures without requiring any prior knowledge. In our future work, we would like to extend our approach to optimize more deep learning architectures, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), since the flexibility of the proposed approach in initializing and training a neural network means that the above implementation would require only a small modification. In addition, we would like to use other advanced evolutionary algorithms or develop a new algorithm to replace PSO in the proposed approach for network configuration optimization since it provides a flexible optimization scheme in which the algorithm is performed in a wrapped manner, and any population-based optimization algorithm can be employed.