An approach on the implementation of full batch, online and mini-batch learning on a Mamdani based neuro-fuzzy system with center-of-sets defuzzification: Analysis and evaluation about its functionality, performance, and behavior

Due to the rapid technological evolution and communications accessibility, data generated from different sources of information show an exponential growth behavior. That is, volume of data samples that need to be analyzed are getting larger, so the methods for its processing have to adapt to this condition, focusing mainly on ensuring the computation is efficient, especially when the analysis tools are based on computational intelligence techniques. As we know, if you do not have a good control of the handling of the volume of the data, some techniques that are based on learning iterative processes could represent an excessive load of computation and could take a prohibitive time in trying to find a solution that could not come close to desired. There are learning methods known as full batch, online and mini-batch, and they represent a good strategy to this problem since they are oriented to the processing of data according to the size or volume of available data samples that require analysis. In this first approach, synthetic datasets with a small and medium volume were used, since the main objective is to define its implementation and in experimentation phase through regression analysis obtain information that allows us to assess the performance and behavior of different learning methods under distinct conditions. To carry out this study, a Mamdani based neuro-fuzzy system with center-of-sets defuzzification with support of multiple inputs and outputs was designed and implemented that had the flexibility to use any of the three learning methods, which were implemented within the training process. Finally, results show that the learning method with best performances was Mini-Batch when compared to full batch and online learning methods. The results obtained by mini-batch learning method are as follows; mean correlation coefficient R¯ with 0.8268 and coefficient of determination R2¯ with 0.7444, and is also the method with better control of the dispersion between the results obtained from the 30 experiments executed per each dataset processed.


Introduction
Currently, different sectors of services and products are requiring systems based on simulation mechanisms for decision making, with the objective of providing greater accuracy, processing of multiple critical variables and in huge volume. As well as the ability to discover hidden relationships, in order to get them valuable insights and knowledge. The capability to build effective solutions that can cope with the complexity intrinsic in data becomes increasingly necessary.
the third layer is dedicated to the definition of rules and the last layer are generated the outputs. Its optimization algorithm is Gradient Descent with the back-propagation algorithm for adjustment of parameters (weights). It helps the diagnosis of different diseases, it is a great electrocardiogram signals classifier that help detect ischemic heart disease [17][18].
• Traffic control: It is a generic neuro-diffuse system self-organized, the architecture of this NN is composed of five layers, and its principal's activities execute in each layer are: first layer (defuzzification), second layer (antecedents), third layer (rules base), fourth layer (consequent derivation) and layer fifth (output defuzzification). It is useful for safe and effective traffic management on the roads, also in the definition of tactical maneuvers (lane change, overtaking, vehicle tracking), coalitions forecast, among others [19].
Other Intelligent Hybrid Systems that have also been successful are: Fuzzy support vector machine (FSVM) giving support to class imbalance issues [20], Artificial immune system and genetic algorithm (AIS-GA) aims to automated diagnosis systems [21], Genetic algorithm, and particle swarm optimization (GA-PSO) used to gene selection [22], Deep learning and extreme learning machine (DELM) used in EEG classification [23].
The principal contribution of this study is to define the implementation of three learning methods: Batch, mini-batch, and online. And through experimentation, obtain information that allows us to assess its performance and behavior under distinct conditions, as well as through the implementation of a neuro-fuzzy system, to seek the optimal adjustment of the parameters through learning and the ease of interpretation through Mamdani fuzzy rules.
In this article, we explore different concepts necessary for understanding the of both learning methods and the neuro-fuzzy system implemented, its sections are organized as follow. Hybrid intelligent systems section where will address the generalities of artificial neural networks, fuzzy logic, fuzzy inferences system and learning methods, the section of description of model proposed, followed by the section of explanation of the experiment done and results obtained through regression analysis, section of discussion with regards to performance, accuracy, and stability observed on our model proposed and others existing model and the last section, it is dedicated to highlighting findings discovery and future work directions.
predicting the output. The output signals between nodes of the layers it is generated through an activation or transfer function. Intermediate layers are known as Hidden Layers and these are considered as essential because endow to ANN to ability to learn the relationships in the data. Finally, in the Output Layer, the results obtained are compared with target sample to know if task has been achieved, these tasks could be classification, clustering, predictions, estimations, among others [27], as can observed in Fig 1. The functional properties of ANN's are: the mapping process is forward, weights are adjusted iteratively after each training and are stored until the error desired is achieved or the total of epochs are executed. To calculate the error, backpropagation algorithm is implemented that has an inverse direction that feedforward process. To calculate the error, backpropagation algorithm is implemented that has an inverse direction that feedforward process, one of most optimization algorithm used is Gradient Descendant in order to minimize the error [28].
They are characterized by their adaptability, processing in parallel and distributed computing, the processing functions from the input to the output can have a linear, semi-linear or non-linear behavior, the neurons can be defined and distributed according to the needs of the problem [29]. They have the ability to approach several degrees of accuracy and recognize hidden patterns from complex and inaccurate data. They are widely used in problems of control, prediction and classification [30].
Different ANN architectures have been developed, such as, Hopfield Networks, proposed in 1982 by John Hopfield [31]. It is considered one of the simplest because it consists of a single neuron and one layer [32]. Multi-layer Perceptron Networks consists of three layers that are: inputs, hidden and outputs. The hidden layer can be constituted by different numbers of layers and inside them, it can contain different numbers of neurons [33]. Self-organized Maps, also known as Kohonen Maps, are known for their grouping, visualization and classification capabilities. It uses unsupervised and competitive learning [34]. Extreme learning machines are Feedforward Neural Networks. Its learning principle is essentially a linear model. They have a good performance of generalization and learning is faster than networks that use backpropagation training [35]. Convolutional neural networks, like ANN, their neurons self-optimize through learning. One of their significant differences with respect to traditional ANN is that their neurons are organized in three-dimensional layers, which is composed of input dimensionality (height and width) and depth [36]. And Deep learning, considers two key factors: non-linear processing in multiple layers, using supervised and unsupervised learning [37].
The importance of ANN consists of the functional and operational ability stable, it is tolerant to partial, noisy and missing information, in the absence of mathematical representation or model can solve complex problems [28].
Gradient descent algorithm. Algorithms based on gradients are one of the most used for the optimization of the error function and parameters adjustment in the training stage of a neural network. They are considered first order, this is referred to how much the function decreases or increases from its first derivative and a specific point of beginning, thus tracing a tangent line over the error surface from the initial point established.
This method has the capability to define a feasible descent directions vector through an iterative process based on information derived from an objective function, where mainly looking adjust parameters (weights) and minimize the model error during its learning process on a spatial of multidimensional inputs to get close a pseudo-optimal solution.
The fundamental components to the implementation of the gradient descent algorithm are: • The error function is identified as a cost and said cost results to the difference between the estimate responseŷ with respect to the response known y, one of the error measures most used is Sum of Squares Error (SSE), which is expressed in Eq (1).
• Gradient vector is built through an efficient method known as backpropagation algorithm, in which the partial derivative of the error function with respect to all parameters per each layer are propagated in a form iteratively and inverse to the calculation of the output signals inter-layers processed in the feedforward stage, this could be expressed as shown in Eq (2).
Where g(ξ) is the gradient vector to all parameters, E is the error function and ξ are all parameters that will be adjusted.
• Generalization of delta rule also known as backpropagation learning rule, is the change applied to all parameters to be updated or adjusted, this change is given applying a learning rate (this allows the control of the descent, that is, the adjust velocity) to gradient vector, as is shown in Eq (3).
Where rξ is the directional change, −η is learning rate and g(ξ) is the gradient vector, later this directional change is applied to the current parameters, to obtain new adjusted parameters and to continue with the iterative learning process, as is shown in Eq (4).
It is easy to implement and shows good results when it comes to nonlinear optimization, however, some of its disadvantages are presented in its performance since it is observed slow and high computing cost when dealing with data sets with high dimensionality.

Different learning methods based on size and partition of the samples for its processing during the training stage
In order to lighten the computational burden, try to find better global minimums and therefore achieve better convergence, there are different learning methods based on size and partition of the samples, as well as, how the gradient vector will be processed and the learning rule will be applied, these are [38-40]: • Batch Training or Full Batch Learning Method: In this method, the partial derivatives of the error are calculated and accumulated with respect to the parameters for each processed data, that is, it is required to do the entire processing of training sample, in order to build the gradient vector, with which finally, the learning rule that will allow updating or adjusting the parameters of the model can be applied. It is observed in this method that, for samples with small volume (number of instances), its performance is stable and with an acceptable convergence, however, it tends to stagnate in local minimums, besides that when its number of instances increases, its time of computation is excessive, it becomes impractical, it cannot be implemented in a parallel environment and its expense of computation and memory becomes prohibitive. The aforementioned behavior can be expressed as shown in Eq (5): Where y p is the desired output,ŷ p is the estimated output, and g(ξ,b) the gradient vector constructed from the partial derivative error function E with respect to all its parameters ξ and bias b, finally the Eq (6) represents the learning rule: Where η is the learning rate, ξ old corresponds to the current parameters and to which the change obtained from the learning rate and gradient vector will be applied, to generate its update, represented by ξ new .
The following pseudo-code tries to represent the functional behavior of this method: 1. while epoch number does not reach its defined maximum

2.
for-each data in training sample

3.
The gradient is calculated for all parameters g(ξ p ,b) 8p = number of data

4.
The obtained gradient is accumulated gðx; bÞ ¼ P q p¼1 gðx p Þ8p ¼ 1; ::; q 5. end 6. end 7. The cumulative gradient and the learning rate are used to update the parameters

end
• Online training or online learning method: in this method, the updating of the design parameters is done every time a data is processed, that is for each data that is in the sample at the same epoch. This type of procedure leads to the calculation of an approximate gradient, its main advantage being the increase in speed, it is adaptable because it is not based on the distribution of its data, unlike the batch learning method, it greatly decreases the load in memory of data and also the cost of computing them, can work in real-time environments, however, their descent is a bit unstable, due to the noise implicit in each data of the processed sample (high variability), which could be beneficial, since that their jumps could be interpreted as potential better local minimums and even achieve a global minimum. Said behavior can be represented as shown in Eq (7): Where g(. . .) represents the gradient vector, and the parameters within the function are ξ design parameters, b bias, x (p) inputs, y (p) desired output and p represents the index or position of the data to be processed.
The following pseudo-code tries to represent the functional behavior of this method: 1. while epoch number does not reach its defined maximum

2.
for-each data in training sample

4.
p = number of data

5.
The obtained gradient and the learning rate are used to update the parameters • Mini-batch training o mini-batch learning: This method improves on difficulties presented both in full batch and online learning method, on one side, it tries to reduce the high variability generated when calculating for each one data the gradient vector, this behavior is shown when Online learning method is executed, or well another behavior that could be presented is when the whole sample is used to build the gradient vector as is the case of the full batch learning method what causes a high cost computational when loading in memory the full sample to be processed. Mini-Batch learning method tries to take advantage of both of them methods, through a partitioning of the data sample into smaller data samples, this combination can lead to a stable descent, greater velocity and reduction of variability, it is also widely parallelizable, so it can be executed in a distributed manner, recommended for use in big data or deep learning environments.
This method is the result of the accumulation of its partial derivatives for each processed mini-batch, where the gradient vector is constructed from the error function averaged in between the defined mini-batch, with respect to all the parameters to be updated or adjusted. Said behavior can be represented under the following Eq ( An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system Where g(. . .)represents the gradient vector, ξ parameters to be updated or adjusted, b bias, the following parameters correspond to each mini-batch to process, these are; x (p:bs) inputs, and y (p:bs) desired output, p index or position of the data in the mini-batch and bs represents the size of the mini-batch to be processed. Finally, Eq (9) represents the learning rule which will allow the adjustment or update of parameters.
x new ¼ x old À Zgðx; b; x ðp:bsÞ ; y ðp:bsÞ Þ ð9Þ The following pseudo-code tries to represent the functional behavior of this method: 1. while epoch number does not reach its defined maximum

2.
while number of mini-batch does not reach total limit of mini-batch

3.
for-each data in current mini-batch training sample

4.
The gradient is calculated for all parameters

8.
The obtained gradient of mini-batch processed is accumulated

12.
The cumulative gradient and the learning rate are used to update the parameters

end
Fuzzy logic system (FLS). From 1920, Lukasiewiscz's spoke about the fact that the values in the logical systems were nothing more than a logic with continuous values [41]. By 1965, Zadeh achieved to crystallize his idea of Cointensive Indefinability which he said was a qualitative measure of proximity of meanings of precision, from which he created his concept of degree of membership, which has been one of the fundamental bases for the development of the theory of fuzzy sets [42,43].
FL was introduced in 1975 by Zadeh in a paper titled "Fuzzy logic and approximate reasoning". Its inspiration is based on the reasoning of the human mind that is approximate rather than exact, giving more importance to the meaning than to the precision of the resultant information [44]. For example, when an object is about to fall on a person's head, the important information for this person is to know that an object will fall on him and not the weight, shape, trajectory, and speed in which this object will fall on him.
The definition of a complex behavior cannot be expressed precisely, instead, we need a system that can tolerate inaccuracies, incomplete information, perceptions, experiences, and judgments, for this reason, FL requires concepts such as fuzzy sets, linguistic variable, and ifthen rules to build a robust system.
A fuzzy set is defined as a set continuous function in the universe of the discourse of X, whose domain is defined with values [0,1], in this context, classical binary logic can be seen as a particular case of Fuzzy Logic (FL). Such a continuous function is known as Membership Functions (MF), it is denoted as μ A (x) and is called a type-1 MF, where A is a fuzzy set of continuous universes of X, and can be expressed as (10): The value of μ A (x) is called the degree of membership, or membership grade, of x in A. The distributions most commonly used for MFs are triangular, trapezoidal, piecewise linear, Gaussian, and bell-shaped.
Linguistic variables allow to represent knowledge in approximate reasoning. Values in this variable are words or sentences in natural language. These variables are characterized by a quintuple, expressed as (X,T(X),U,G,M), where X is the name of linguistic variable, T(X) is the collection of linguistic values, U is the universe of discourse (or numeric domain subjacent), G free-context grammar and M is a semantic rule that associate each linguistic value with its meaning.
The representation of knowledge is implemented in a proposition in a form of rules if-then and it is known as fuzzy rules. These rules are expressed as IFhantecedentsiTHENhconsequenti both antecedent and consequent are fuzzy propositions that contain linguistic variables. Since fuzzy sets do not have a finite set of possibilities defined for each input, it is necessary to express its operators as functions for all probable fuzzy values. These operators are expressed for all A and B fuzzy set as: • Intersection (operator AND): its generalized form is known as T-norm (11).
• Complement (operator NOT): where the fuzzy set A denote de complement of fuzzy set A (13).
T-norm and T-conorm are considered as generalized disjunction and conjunctions respectively and are known as fuzzy implication. The implication is one of the major connectives in any logical systems, and it has a very serious influence on the performance of the systems in which fuzzy logic techniques are employed [45].
Fuzzy inference system (FIS). It is a framework based on concepts of fuzzy logic, fuzzy set theory, and fuzzy rules, that have been success in application areas, such as control, support decision, identification systems, among others. Its strength relies on it is capable of handle linguistic concepts (model natural language), it is considered as universal approximator and it is able to perform nonlinearly mapping between inputs and outputs [46].
The general architecture of a FIS is based on the follows components; fuzzifier, fuzzy inference engine, and defuzzifier as shown in Fig 2 [47][48].
The details of each component FIS are described below [47]: • Fuzzifier is in charge to convert crisp values of the universe of discourse and determine of the membership degree of these inputs to the associated fuzzy set, through mathematical procedures. For example, let A and B two fuzzy sets and X the universe of discourse, the fuzzification process take the values received a,b2X, and then produce a membership degree, that can be expressed as follows (14): • Fuzzy inference is the process where FL is used through its fuzzy rules, membership functions and operators of fuzzy implication, with the aims of mapping inputs values to outputs. The flow of this process is as follow; the fuzzified inputs are mapping to a rule base, in this phase, the antecedent part of the proposition, input sets can be combined through fuzzy operators to generate a compound fuzzy proposition, and the consequent part are determined by degrees of membership in inputs sets and its relationships. The rules base can be seen as follows (15): • Aggregation, in this process, all fractional membership functions resulting from consequent part are combined, with the objective to obtain a consolidate fuzzy set. In this phase, a singleton value is determined for each y i 2 B k i , generality using the max operator, this can be expressed as follows (16): Where a k i is the rule firing of each output y in the consequent. • Defuzzifier, this is the process in charge to convert a fuzzy set to crisp number as output of a fuzzy system, this value can be used in expert system to make a decision or in a controller to exercise action. In a fuzzy system with more than output variable, the defuzzification processes are by each output. There are many different of defuzzification methods, one of them is a variation of the max criterion method, where the popular method in this category is MOM (Mean of Maxima), in this method, the final output value is calculated by averaging all output values with the highest membership values. The equation that represent this An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system behavior is (17): Where M is the set that contains all maximum membership values, these values are represented by x i , and l is the cardinality of the set M.
Other popular methods are those based on centers: Center of Sums (COS), in this method, first calculate the geometric center of area for each membership. The equation that represent this behavior is (18): Where CoA i is the geometric center of area of the scaled membership function in the i th rule, that there fractioned by firing strength identified, n is the number of scaled membership functions, and area i is the area of the scaled membership function n.
Center of sets (COS), in this method, for each rule consequent a singleton centroid is located, as well as, the firing level is necessary. The equation that represent this behavior is (19): Where c l is the centroid, l is the l th consequent set, and its firing level represented by f l given by a fuzzy value contained in x 0 . Other methods based on centroid is Center of gravity (COG), in this method, COG is calculated over a series of points continuums in the scaled membership function and finds a representative point of COG from fuzzy set. This method can be expressed mathematically as follow (20): Where A is the sub-area from fuzzy set evaluated, a,b is the interval of the sub-area of A, and x is the sample of values in this interval from sub-area of A.
There are other more defuzzification methods, but only showed the based-on centers and centroids, because the architecture neuro-fuzzy system proposed was based in el defuzzification method on center-of-sets.
The two most used FIS are the Mamdani-type and the Sugeno-type, and their most relevant differences are in the output, since Mamdani-type generate a fuzzy sets and Sugeno-type are linear functions or constants in the consequent.

Description of proposed model
The design and implementation of a Mamdani-based neuro-fuzzy system with center-of-sets defuzzification was made, that provide flexibility to handle inputs and outputs multiple, as well as, also can use any of the different learning methods, such as, batch, online or mini-batch, since the existing neuro-fuzzy systems only support batch learning method. As mentioned, what this study seeks is to establish a reference with respect to the implementation of different learning methods based on size and partition of the samples, the accumulation or not of gradient vector calculated and the form that the parameters will be updated or adjusted during the training stage. For this reason, we opted for the development of a neuro-fuzzy system that would allow us to experiment with all three learning methods, full batch, online and minibatch.
Description of the architecture of the Mamdani based neuro-fuzzy with center-of-sets defuzzification. The proposed neuro-fuzzy system was defined as feedforward for the calculation of the output signals, where each of its layers are defined as follows; the zero layer corresponds to the inputs (which can range from 1 to n), the first hidden layer consists of adaptive nodes, they are considered adaptive since it contains design parameters that will be adjusted through the iterative process backpropagation, since this is where the error is calculated from the output and it is propagated between its layers of an inverse form to the feedforward, within each node, a Gaussian function has been defined and its adjustable parameters are; its means and standard deviations, as well as the previous establishment of the rules number (r) that will correspond to the number of nodes that will contain both layer one and layer two.
The output signals of the first hidden layer will be the inputs to the second hidden layer, which consists of fixed nodes and where a normalization of said signals will be processed, finally to generate the calculation of the output signals of the last layer that will be defined by m outputs, both the signals of the output of the hidden layer two and its centroids (the latter considered also design parameters, so they will also be adjusted during the training) will be required, all this behavior can be observed in the (Fig 3).
The above defines the elements and behavior of the neural network for the optimization of the design parameters, on the other hand, it also makes the definition of the elements and behavior of the fuzzy part of this system. We begin by defining the knowledge base based on Mamdani, as shown in the generic rule (21).
Where R k corresponds to the kth rule, the antecedent part is defined by its entries represented by r x n i 8i ¼ 1; . . . ; n, the definition of its firing forces represented by F k i 8i ¼ 1; . . . ; k, An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system the consequent part defined by their outputs y m i 8i ¼ 1; . . . ; m and its generic centroids represented by C G k m 8m ¼ 1; . . . ; k: Hereunder, each one of the layers defined in the architecture of the neuro-fuzzy system proposed is detailed: • Layer 0: in this layer is to find the input matrix (22), which are used to calculate the firing force. The sub-index i 0 corresponds to the entries in layer zero and p it is the sub-index of the data for each entry.
• Layer 1: in this layer, the firing force is calculated from the inputs a 0 i 0 ;p and the adjustable design parameters during the iterative backpropagation process, are; m i 1 ;i 0 (mean) and s i 1 ;i 0 (standard deviation). The adaptive nodes of this layer contain a Gaussian function and their definition is shown in the Eq (23).
Where sub-index i 0 corresponds to the inputs of layer 0, sub-index i 1 corresponds to neurons in layer 1, r is the rules number in the layer, p is the input data index and q is the total number of data. • Layer 2: in this layer, firing forces are normalized (24), from the output signals of the previous layer.
Where a 1 i 1 ;p is the output signal of the layer 1. • Layer 3: in this layer, the output signals are obtained a 3 i 3 ;p , from the output signal of the previous layer a 2 i 2 ;p and the centroids that are part of the design parameters C i 3 ;i 2 (25).
This whole trajectory is known as the feedforward method, once reached the last layer, it begins the backpropagation process to calculate the error partial derivates with respect to all design parameters: • Error calculation: obtained from the difference of the desired output and the estimated output signal (26).
An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system • Error calculation squared by each data: of the obtained, the error is individually squared (27).
• Sum of the squared errors: a summation is made to obtain a total error, from the error by each data raised to the square (28).
After the feedforward routine was finished, the iterative process backpropagation begin, this is the procedure to find the gradient vector (33). Its inverse trajectory is done as follows: • Layer 3: the error is calculated ε 3 i 3 ;p (29), this is the derivative of the error measure E p .
In this same layer, the partial derivative of the error with respect to the centroid is evaluated C i 3 ;i 2 and the output of layer 2 (30).
• Layer 2: the error is calculated ε 2 i 2 ;p , from the centroids and error of layer 3 (31).
• Layer 1: the error ε 1 i 1 ;p is calculated from the difference between the error of the previous layer ε 2 i 1 ;p minus the error product of the same layer and the output signal of the previous layer a 2 i 2 ;p between each of the output signals of the current layer a 1 i 1 ;p (32).
The derivatives of the mean parameters were calculated (33) and standard deviation (34), same that will serve to build the gradient vector, to finally update the design parameters with the new directional vector.
Once the partial derivatives of the mean, standard deviation and centroids parameters with respect to the error are obtained, the gradient vector is built, and a learning rate is applied (35) to generate the directional change with it which will update or adjusted all parameters (36).
Where ξ2{σ,m,C} is the vector of all design parameters of the Mamdani-based neuro-fuzzy system with center-of-sets defuzzification, and −η corresponds to the change in the learning rate which is part of the training parameters.
Hyperparameters and functionality description. As is known, hyperparameters are those that must be configured prior to the execution of training, although there are recommendations to choose their appropriate values, there is nothing conclusive, as in the case of the rate of learning, it is known that a learning rate with very low values will lead to a slow and computationally expensive training, since it will require many iterations to be able to approach a quasi-optimal solution and depending on the complexity of the dataset, it could never be achieved, or, if the established value is very high, it will tend to diverge the solution area. In the case of the experimentation carried out, to palliate this problem, for each dataset analyzed, the best values for each hyperparameter were located, establishing a range of values and searching among them, those that would provide better results with respect to the minimization of the error function. The hyperparameters required for the Neuro-Fuzzy system that was proposed are: • Training parameters, these parameters are defined in order to control training duration, as described below: ■ Total epochs number is the maximum limit of iterations that will be carried out during training.
■ Goal error, the minimum value of the ideal error that could reach.
■ Learning rate (η), this parameter allows us to determine how fast or slow the gradient vector moves towards obtaining the optimal parameters.
■ Momentum (mc), fraction of the change in parameters that allows smoothing the oscillation in the trajectory, either increasing or decreasing the parameters change in each iteration [49].
■ Maximum validation failures number is the maximum limit of failures allowed in the validation process.
■ Maximum error increase, refers to the maximum allowed limit of the calculated error ■ Minimum gradient limit is the minimum value of the norm of the calculated gradient vector.
■ Decrease rate, this value is established, to decrease the learning rate to this defined proportion, in case the calculated error is greater than the maximum increase established.
■ Increase rate, this value is established to increase the learning rate to this defined proportion, in case the calculated error is less than previously calculated.
■ Rules number (r), this value is established, to indicate the rules number with which the fuzzy part will count, this parameter will be equivalent to the number of neurons that will contain layer 1 and 2 of the network.
• Design parameters are those that will be adjusted in each execution of the training until reaching the optimum. These parameters are; mean, standard deviation and centroid. Its initial values are obtained from the dataset to be processed, thus generating a matrix with r×n dimension, where r is the rules number and n is the inputs number.
• Three main functional processes are identified, to carry out the learning, these are: • Initialization: In this first process, the initial change of the design parameters is calculated as is represented in Eq (37), it is required since in the training the optimization function of the Gradient Descendent with Momentum and Adaptive Learning Rate will use it.
• Training: It starts calculating the Gradient Descendent with Momentum and Adaptive Learning Rate (38): It is in this process, where the learning method to be used is established, which will define the way in which you will be sending the data to generate the gradient vector and the calculation of the error. In order to control the velocity and direction of gradient descent, the following heuristics were implemented as shown in the following pseudo-code: 1. While epoch number does not reach its defined maximum

2.
Gradient descent with momentum and adaptive learning rate is calculated

5.
Current error is calculated with the change applied to the parameters 6.
If (current error/previous error) is greater than maximum error increase parameter then

10.
If current error is smaller than previous error then

13.
Parameters are updated with the change calculated as in line 4, but now in permanent form

15.
Previous error is replaced by current error

16.
Gradient vector is recalculated with the new parameters

End
The following criteria were also considered, which allow deciding the moment in which the training should be interrupted, this is: ■ When the epochs number processed is equal to the training parameter defined as total epochs number.
■ When the calculated error is less than or equal to the training parameter defined as goal error.
■ When the norm of the gradient vector is less than the training parameter defined as the minimum gradient limit.
■ When the accumulated validation failures number is greater than the training parameter defined as the maximum validation failures number. Lastly is calculated from the validation data sample, the calculation of the error (using the design parameters that were previously adjusted in the training), it is validated if the current error is greater than the previous error, in this case it is increased a counter of validation faults, and if this counter becomes greater than this parameter, then training is interrupted.
Finally, the error function that is sought to be minimized during the training process was defined as Sum of the squared errors, this is a validation metric which allows to verify the progress of the training in the direction of the optimal convergence.

Results and discussion
In this section, the experimentation methodology and the regression analysis of the results obtained are presented. The objective of this analysis is to carry out a comparative study of the An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system performance of each of the exposed learning methods (batch, online and mini-batch), executed on a proposed Mamdani based neuro-fuzzy system with center-of-sets defuzzification.
Different datasets were used to carry out experimentation, their descriptive characteristics being those found in Table 1.
The random sub-sampling validation technique was used, for which 30 experiments were defined, where for each experiment a new test set is constructed, distributed in the following way; 60% for training, 20% for validation and 20% for testing, in order to avoid overfitting and overtraining, and to ensure the robustness of the neuro-fuzzy system used.
One of the critical training hyperparameters, which help to minimize the error function during training, if it is established with the appropriate value, is the number of rules, to be able to choose the right value to the evaluated dataset. Per dataset 30 experiments were done, a range of minimum and maximum values were established that could be assigned as a rule number value, for each value of the range, training was executed and the error obtained was stored, finally, that rule number value associated with a minor error result was counted and the one that had more frequency was the value chosen as adequate, to perform the complete experimentation and obtain the general results.
The metrics on which the comparative analysis was based were R correlation coefficient, R 2 coefficient of determination and RMSE root mean square error. We start by evaluating and analyzing the results obtained from all the datasets used for the first metric. As we can see in Table 2 those learning methods that obtained the best means and maximum results, as well as those that presented greater stability by showing lower standard deviation, have been highlighted in bold.
As can be observed in Table 2, the highest means of correlation coefficients ( � R) compared between learning methods and processed dataset were for; mini-batch with 5 best � R from a total of 7 processed datasets, with a representation percentage of 71% of its experiments with the best � R obtained, followed by full batch with 2 better � R from a total of 7 processed datasets, showing a percentage representing 29% of their experiments with better � R obtained, in no case the mean result of the online learning method was better than mini-batch and full batch, however, as can be observed in Fig 4, their differences with respect to the full batch and mini-batch learning methods were not very large.
An indicator was defined as % stability, this indicator was calculated as follows, for each learning method, dataset processed and experiment executed, its R results that went from the mean to higher were counted (remember that the experimentation consists of 30 runs), this allows monitoring the stability of the neuro-fuzzy system. Mini-batch learning method, was which obtained 4 of 7 datasets processed with a stability of 57%, followed by the full batch learning method was which obtained 3 out of 7 data sets processed with a 43% stability, finally, An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system the online learning method got 29% stability, it should be noted that in the chemical data set both the mini-batch and online learning method obtained the same result, as can be seen in Fig 5, the differences between stability percentages each learning method are not very large, only for gauss3 dataset case with online learning method, and the bodyfat dataset with the mini-batch learning method, some minimal peaks are observed with respect to those that obtained a greater percentage of stability, outside theses, the rest of datasets and their % stability are shown with very close differences. Another important indicator is the rules number since this hyperparameter refers to the capacity of the model to generalize a solution surface, consequently, it is considered that the smaller the rules number, the greater will be their ability to have a generalized model. The  An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system results that were obtained were the following; 71% of the datasets that were processed under the full batch learning method were processed with fewer rules, that is, 5 of 7 datasets, followed by the mini-batch learning method with 43% representing 3 of 7 processed datasets and finally the online learning method which obtained 14% with only 1 of 7 processed datasets, as it is shown in Fig 6. Finally, an indicator that allows observing neuro-fuzzy system performance with respect to the computational load that could be represented by the iterations number necessary to the processing of the dataset under different learning methods, it is known as the epochs number required during the training stage to generate the model. As it is shown in Fig 7, who required more epochs number were the Full Batch and Mini-Batch learning methods, of the which the latter in 2 of 7 processed datasets trained in fewer numbers of times than the rest of the learning methods with a representation percentage of 29%, at all times the full batch learning method exceeded the epochs number required for training, and who generated their training in fewer epochs number was the online learning method with 5 of 7 datasets processed and a representation percentage of 71% with a lower epochs number required in training, however, the fact that it has ended early, it does not mean that it has achieved a better correlation coefficient or a better convergence, as shown in the previous figures, really, the online learning An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system method behavior obeyed the validation criterion that monitored if the error function no longer decreased then the training was interrupted.
Another metric analyzed was the coefficient of determination R 2 , which allows us to know how well the generated model fits the data, as shown in Table 3, the results that were obtained from the Mini-Batch learning method was 71% of the best R 2 with 5 of 7 processed datasets, followed by the Full Batch learning method with 2 of 7 datasets processed with the best mean results of R 2 with a percentage of representation of 29%, and even though the online learning method did not obtain the best means in any of the processed datasets with respect to the other learning methods, their differences were not as significant. These frequencies are shown in   For the purposes of the indicator % stability by learning methods, as shown in Fig 9, it is considered as stable the experiments that obtained R 2 results from the mean upwards, so for each dataset processed and for each experiment executed in said dataset, the experiments that obtained results from mean or bigger were counted, based on this, it was obtained that for the Mini Batch learning method its stability percentage was 71% (5 of 7 datasets processed), followed by the full batch learning method with 29% stability (2 of 7 datasets processed) and finally the online learning method with 14% stability (with 1 of 7 processed dataset).
Finally, the last metric evaluated is the root mean square error RMSE, which shows the differences obtained between the estimated and expected model, as shown in Table 4 and Fig 10, the behavior and trends among the three learning methods is very similar, however, the learning method that achieved a greater decrease in error based on the RMSE metric (Root Mean Square Error) was mini-batch with 4 of 7 datasets processed, with a percentage of representation of 57%, as well as for full batch and online learning methods with 2 of 7 processed datasets and 29% representation percentage.
Hereunder, we will analyze obtained results from the perspective of applied learning methods, to try to highlight some relevant characteristics of each of the evaluated learning method, Table 5 allows us to compare the efficiency and performance based on the mean results of chosen metrics.
As previously mentioned, the composition of these results is based on the mean obtained from all the experiments generated from the different datasets analyzed, for each applied An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system learning method. As seen in Table 5, the general tendency of � R was positive, with the minibatch learning method showing the greatest strength in its correlation and better control in its variability, this can be clearly seen in Fig 11 and Fig 12 respectively.
The models obtained from the learning methods based on mini-batch were better adjusted to the data, so they presented a better explanation of the variability of the estimated values with respect to the expected ones, as presented in Fig 13 and Fig 14 respectively.
Stability studies were carried out, that is, from each experiment processed, how many obtained correlation coefficients from the mean to the maximum, with the batch learning An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system method obtaining the highest percentage, but only by one percentage point with respect to mini-batch, as can be seen, both in Table 5 and Fig 15. The rules number mean required for the training shows that both the batch and mini-batch learning methods required fewer rules number during their experimentation than the onlinebased learning, as can be seen in Fig 16, however, it was this the last one who finished his training in fewer epochs number, this being an advantage with respect to batch, because as shown previously both have very close results in the metrics of � R and � R 2 , however, the stability of the online learning was below both batch learning and mini-batch.
In order to evaluate the goodness of fit of the trained model with the neuro-fuzzy system proposed under the different learning methods, an error minimization performance analysis was executed using the following metrics; SSE (Sum of Squared Error), MSE (Mean-Squared Error) and RMSE (Root Mean-Squared Error), generated in two moments, the first one occurred in training where for each dataset (training, validation and tests) results were obtained from aforementioned metrics and finally after the model was adjusted, the test is generated with the complete entries to evaluate its goodness of fit, also under the same metrics.
According to Table 6 and Table 7, it is observed that 3 out of 6 datasets used in the experimentation with respect to its mean value calculated had greater error reduction, representing 50% of the processed datasets, with the mini-batch learning method being the one had better control in the minimization of the error and that also showed more stability as can be observed in the results obtained in the standard deviation.
Finally, the following figures show the behavior of minimization of SSE in training time and in the adjusted model test. These figures are shown below by processed dataset. The following figures correspond to synthetic curve dataset, for which in Fig 17 it is observed that the batch learning method was the one that achieved the best minimization of the SSE during training, but also in a greater epochs number. Fig 18. shows stable behavior in the minimization of the SSE, this analysis is supported in Table 6 according to its results of mean and standard deviation and it is concluded that mini-batch learning method obtained better results. An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system Continue with gauss3 dataset, for which in Fig 19 it is observed that the batch learning method was the one that achieved a better minimization of the SSE during the training, but also in a greater epochs number. It is also shown that this same learning method achieved greater stability in the minimization of the SSE when testing on the adjusted model, as can be seen in Fig 20, the analysis is supported by mean and standard deviation shown in Table 6.
Follows with bodyfat percentage dataset, for which in Fig 21 it is observed that the minibatch learning method was one who achieved the best SSE minimization during training, and in a smaller epochs number. It is also shown that this same learning method achieved greater stability in the minimization of the SSE when testing on the adjusted model, as can be seen in Fig 22, also supporting the conclusion of the results through the mean and standard deviation shown in Table 6.
Following dataset corresponds to chemical sensor dataset, for which the learning method that achieved a better minimization of the SSE both in training and in the test on the adjusted model was for online, as can be seen in Fig 23 and Fig 24 respectively, also supporting the conclusion of the mean and standard deviation results shown in Table 7.
For the engine behavior dataset, the learning method that achieved a better minimization of the SSE during training and also during the test on the adjusted model was for mini-batch, as An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system  An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system can be seen in Fig 25, Fig 26 and Fig 27 respectively, also supporting the conclusion of the mean and standard deviation results shown in Table 7.
Finally, for the abalone shell rings dataset, the learning method that achieved a better minimization of the SSE during training was for batch and during the test on the adjusted model it was for online, as can be seen in Fig 28 and Fig 29 respectively, also supporting the conclusion of the mean and standard deviation results shown in Table 7.

Conclusions
This paper establishes the basis for the implementation of Full Batch, Online and Mini-Batch learning methods both in a manner theoretical and practical, those methods are oriented to the processing of data according to the size or volume of the sample data and the form that the gradient vector is built and the parameters are adjusted or updated. Due to existing neurofuzzy systems which can only be trained under full batch learning method, it was necessary to implement a Mamdani based Neuro-Fuzzy with Center-of-Sets Defuzzification with the flexibility to work with any of the three learning methods during the training stage.
The main contribution of this paper as a first approach was to offer detailed theoretical and practical procedures about the three learning methods on a neuro-fuzzy system with the objective to better understand its functionality, performances, and behaviors under different contexts when a model is built. A variety of synthetic and real datasets with small and medium volume were used to carry out the experimentation.
The results obtained at a general level, that is, for each learning method evaluated, were based on the mean, obtained from all the experiments generated from different analyzed datasets through regression model built from neuro-fuzzy system proposed. It was observed that An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system the Mini-Batch learning method with respect to the general trend of � R was positive, It showed greater strength in its correlation and better control in its variability with mean results of 0.8268 y 14.12% respectively, followed by the Batch learning method with � R = 0.7708 and finally the Online learning method � R = 0.7520 but both with a very close coefficient of variation, with a difference of 0.58%, giving advantage to the Batch learning method. A similar behavior occurred with respect to the determination coefficient where the Mini-Batch learning method obtained a � R 2 ¼ 0.7444 with a C.V = 6.50%, followed by the Batch learning method An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system with � R 2 ¼ 0.6906 and finally, the online learning method with � R 2 ¼ 0.6456, with a C. V = 14.61% y 15.51% respectively. Finally, it is observed that the most stable learning methods in their experimental executions were the Batch and Mini-Batch learning method, being Batch learning method who had an advantage, but only by 1% more than Mini-Batch, it is also observed that it was these two learning methods that required the least number of rules, but it was the Mini-Batch and Online learning method that finished their training in fewer epochs. An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system We consider that the study is relevant due to the growing necessity of tools that have the possibility of being able to handle and analyze large volumes of data, and that allows for adequate management of their computational resources. The Mini-Batch learning method is highlighted as a very good alternative, since it can be performed in distributed environments because they are highly parallelizable and attenuate the difficulties that the Batch and Online learning methods present individually. As subsequent works, it is necessary to look for other An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system methods of optimization that will lead us to better results with respect to the minimization of the error function and also to scale it to Deep Learning and Big Data environments, where treatment of high volumes of data are required for its processing. An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system An approach on the implementation of full batch, online and mini-batch learning on a neuro-fuzzy system Supporting information S1 Code. (ZIP)