Identification of Relevant Phytochemical Constituents for Characterization and Authentication of Tomatoes by General Linear Model Linked to Automatic Interaction Detection (GLM-AID) and Artificial Neural Network Models (ANNs)

There are a large number of tomato cultivars with a wide range of morphological, chemical, nutritional and sensorial characteristics. Many factors are known to affect the nutrient content of tomato cultivars. A complete understanding of the effect of these factors would require an exhaustive experimental design, multidisciplinary scientific approach and a suitable statistical method. Some multivariate analytical techniques such as Principal Component Analysis (PCA) or Factor Analysis (FA) have been widely applied in order to search for patterns in the behaviour and reduce the dimensionality of a data set by a new set of uncorrelated latent variables. However, in some cases it is not useful to replace the original variables with these latent variables. In this study, Automatic Interaction Detection (AID) algorithm and Artificial Neural Network (ANN) models were applied as alternative to the PCA, AF and other multivariate analytical techniques in order to identify the relevant phytochemical constituents for characterization and authentication of tomatoes. To prove the feasibility of AID algorithm and ANN models to achieve the purpose of this study, both methods were applied on a data set with twenty five chemical parameters analysed on 167 tomato samples from Tenerife (Spain). Each tomato sample was defined by three factors: cultivar, agricultural practice and harvest date. General Linear Model linked to AID (GLM-AID) tree-structured was organized into 3 levels according to the number of factors. p-Coumaric acid was the compound the allowed to distinguish the tomato samples according to the day of harvest. More than one chemical parameter was necessary to distinguish among different agricultural practices and among the tomato cultivars. Several ANN models, with 25 and 10 input variables, for the prediction of cultivar, agricultural practice and harvest date, were developed. Finally, the models with 10 input variables were chosen with fit’s goodness between 44 and 100%. The lowest fits were for the cultivar classification, this low percentage suggests that other kind of chemical parameter should be used to identify tomato cultivars.


Introduction
Wild tomatoes are native from western South America. The generic status of wild tomatoes within the family of Solanaceae has been a matter of controversy since the eighteen century. Linnaeus in 1753 classified tomatoes in Solanum genus while Miller, a contemporary of Linnaeus, classified tomatoes in a genus Lycopersicon. At present, tomato is classified as Solanum lycopersicum cv Mill. There are a large number of tomato cultivars with a wide range of morphological, chemical, nutritional and sensorial characteristics [1].
Tomato is one of the most widely consumed fresh vegetables in the industrialized world. It is also widely used by the food industries as raw material for the production of purees, ketchup and other products. Tomato is considered as a functional food due to its special composition of bioactive compounds, as it is a good source of minerals, fibre, vitamins and antioxidants such as lycopene. Tomato is also the most common vegetable in the Mediterranean diet, a diet known to have health benefits, especially to avoid the development of chronic degenerative diseases [2].
However, many factors are known to affect the nutrient content of tomatoes, such as cultivar, climate, geography, soil and water geochemistry and agricultural practices [3]. This explains the quite large number of studies aiming to evaluate and improve the quality of tomato fruit. The obstacle has been, however, that the interactions between genetic properties, environmental and agricultural practices are complicated. A complete understanding of the effect of these factors would require not just an exhaustive experimental design, but also a multidisciplinary scientific approach and a suitable statistical method to search for patterns in the behaviour of the variables investigated [4].
Although sensory evaluation is the best method to characterize tomato fruit, these test are expensive, time-consuming, and require a panel with a considerable number of experts, and panellists often constitute the first source of variation. Thus, when a high number of samples are to be analysed, this type of evaluation can be substituted by the multivariate analytical techniques to discover hidden relationships, correlations, trends and associations in data [5].
However, there are considerable difficulties in analysing and interpreting this kind of data so it is necessary to apply statistical tools that can reveal behaviour patterns. Some multivariate analytical techniques such as Principal Component Analysis (PCA), Factor Analysis (FA), Linear Discriminate Analysis (LDA) and Cluster Analysis (CA) have been widely applied to this problem. PCA reduces the dimensionality of a data set having a large number of inter-correlated variables, while retaining as much as possible the information present in the original data. The reduction is achieved through a linear transformation to a new set of uncorrelated latent variables that express most of the variation of the original variables. FA transforms a n-dimensional data structure to another with considerably less dimensions, like PCA, but gives the opportunity to the researcher to select between uncorrelated factors [6].
CA is one of the most useful statistical tools used in chemometrics for discovering groups and localizing (identifying) interesting distributions and patterns in the underlying information contained in the data. LDA is based on the extraction of discriminant functions of the independent variables by means of a qualitative dependent variable and several quantitative independent variables. The method supplies a number of linear discriminant functions to provide a method for predicting the group into which a new case will most likely fall [7].
Although some of these methods are clearly better than others under a given set of circumstances, there is no single ''best" approach, but in some cases it is not useful to replace the original variables with these linear combinations. In multienvironment trials, biplot analysis is being increasingly used in the analysis of this kind of data. The biplot methods proposed by Gabriel [8] are a graphical display of multivariate data in two dimensions. This is done by representing the variables as vectors in the same plane and the correlations between them as the angles between those vectors. Also, tree-structures and other models based on Artificial Intelligent (A.I.), such as the Artificial Neural Networks (ANNs) must be taken account.
Trees-structures are used for the classification, least squares regression and analysis of censored survival data [9]. The first tree-structure was the Automatic Interaction Detection (AID). AID algorithm is based on the partitioning of a group into other subgroups according to an independent variable called predictor linked to a dependent or response variable able to distinguish the subgroups inside the original group. In each subgroup, there are new individuals that can be affected by other variables. In this way, AID detects the automatic interaction among individuals and variables [10].
Regarding ANNs, they have been applied in many fields such as flow river prediction to prevent floods [11], to predict the average monthly wind speed in one station from others neighbouring stations [12], to modelling and control of nonlinear systems [13], to predict the traffic flows in an urban street [14], to predict the critical micelle concentration (CMC) values in different surfactants [15], or even in predictive systems on the stock market [16]. Neural Networks are a modeling method that imitates the human brain [17]. The basic processing unit in an artificial neural network (neuron) is based on the biological cell, thus an ANN will have a large number of interconnected neurons. In this simulation method, the database is divided into training and validation data. The large number of connections make the ANN capable of finding the more important relationships between the variables (key relationships) in the training database, used to generate the model, and then apply that knowledge to new cases previously unseen, known as validation data [18]. Thus, ANN do not look for the formulation of a physical or chemical law in the training database used for model implementation, but rather, they look for a relationship among the data to achieve a result close to the expected value. This modelling method is particularly useful for complex problems where there are many variables involved and our knowledge of the variability of these variables and their interactions is limited.
In this paper, two objectives were considered. The first one was to identify those relevant phytochemical constituents responsible for the main differences among tomatoes samples by Automatic Interaction Detection method and the second objective was to develop an authentication Artificial Intelligent model that could predict with accuracy the cultivar, the production type and the harvest date.
The reasons for selecting the AID algorithm as a method to identify the relevant phytochemical constituents are as follows: AID algorithm uses original variable instead of latent variables used in CPA and AF, and trees-structures are easily understandable and interpretable. A correct identification of the relevant phytochemical parameters allows to understand the relationship between environment and chemical composition. This understanding could help genetic improvement programs.
Some authors have pointed out that CA and LDA frequently fail to differentiate food samples because linear functions are probably not appropriate to describe this kind of data. There are many external factors that can have influence in the chemical composition [7]. Thus, the use of ANN as a mathematical tool is feasible because ANNs are based on non-linear functions. An adequate model allows to develop a method to detect food fraud.

Tomato sampling and sample preparation
One kg samples were collected during four different harvesting periods (October, December, February and April), but at the same degree of ripeness according to the Dutch "kleurstadia" tomato-colour scale. They belonged to five cultivars (Dorothy, Boludo, Dominique, Thomas and Dunkan) grown under three farming practices: conventional, organic, and no-soil on coconut fibre substrate. In the trials, the UNE 155102 standard for the controlled production of tomatoes and several European regulations on organic production and on maximum residue levels of pesticides were taken into account. Regarding the no soil tomatoes, the nutrient solution consisted of 12mM N-NO 3 -, 0.5 mM N-NH

Sample preparation method
Three tomatoes were randomly selected from each tomato sample for analysis. The samples were hand-rinsed with ultra-pure water, shaken to remove any excess water, and gently blotted with a paper towel. They were then mixed and homogenized to homogeneous puree. A fraction of this purée was desiccated, homogenized again, and stored in a polyethylene tube (10 mL) at room temperature until assay for metals, protein and total fibre. The rest was stored in a polyethylene tube (15 mL) at -80°C for the measurement of the other chemical parameters: fructose, glucose, organic acids (citric, malic, oxalic, pyruvic, fumaric and ascorbic), lycopene, phenolic compounds and hydroxycinnamic acids (caffeic, p-coumaric, chlorogenic and ferulic). Data are expressed as % or quantities per fresh weight.

Analytical parameters
The mineral concentration was determined by atomic absorption spectrophotometry following nitric acid digestion except for phosphorus which was measured by a colorimetric method, using a vanadate-molybdate reagent. The nitrogen concentration was determined by the Kjeldahl method, and then the protein concentration calculated using a nitrogen factor of 6.25 [20]. The ascorbic acid and total fibre content were determined using approved methods described by AOAC [20]. Analytical HPLC methods were used to measure the concentrations of sugars (glucose and fructose), organic acids (citric, malic, oxalic, pyruvic and fumaric acids) and hydroxycinnamic acids (chlorogenic, caffeic, p-coumaric, and ferulic acids). The analytical HPLC methods used were the same previously described in Hernández et al. [19]. The chemical parameters were determined in triplicate for each sample. Supplementary file shows the mean values according to the harvest date (S1 Table), agricultural practices (S2 Table) and tomato cultivars (S3 Table).

Statistics
Automatic Interaction Detection (AID) analysis. According to Santesmeses [21], AID analysis is a sequentially repeated one-way ANOVA. In each step, the algorithm reveals the best variable able to divide the initial group. The partition among categories must maximize the inter-groups variance and minimize intra-group variance. However, the data of this study are affected by three factors (harvest date, agricultural practice and cultivar) so this concept must be adjusted.
In principle, ANOVA can be used with any number of factors. With one factor is called one-way ANOVA and with two or more factors are called factorial ANOVA. The linear model that represents the structure of the experiment is called General Linear Model (GLM). This model contains a term for the baseline, a term for each individual factor or main effect, a term for each interaction, and a term for error. GLM is used to find out how the average value of the dependent variable differs across the categories being compared [22].
The requirements to apply GLM are the same that ANOVA, normal distribution and homoscedasticity. Both requirements were achieved in this study by means of the standardization according to Eq 1: Where X ij is each data point within the variable j, X j is the average of the variable j and σ j is the standard deviation of the variable j. The Eq 1 transforms the data set to have zero mean and unit varianza. This transformation also allows to give each variable equal weight in the statistical analysis [7].
The GLM equation adapted to our case was (Eq 2): Where y ijk is the mean value of the response variable of the i th tomato cultivar (Dorothy, Boludo, Dominique, Thomas and Dunkan), j th production system (conventional, organic and no-soil), k th harvest date (October, December, February, April), μ is the baseline, c i is the cultivar effect, p j is the production system effect, h k is the harvest date effect and ε ijk is the error associated to the model.
The following criterion were considered to obtain the possible solutions for selecting a predictor that can divide each group and can produce a tree-structure: i) the response variable (the chemical parameter) must depend on one predictor to be considered stable and suitable to organize the results as a tree-structure, ii) the p-value of the adjusted model must be the lowest, and iii) in case of coincidence, the explained variance of the model must be the highest. Once the predictor was selected, a Bonferroni post hoc test [23] was performed to identify the subgroups. This process is sequentially repeated until a significant predictor is not observed. All subgroups or nodes obtained were grouped like a tree-structure. The GLM-AID tree-structure is based on the main significant differences among the categories of one factor for one attribute, in our case a chemical parameter.
These calculations were performed using the SPSS version 21.0 for Windows software package.

Artificial Neural Network
The development of a neural network is based on the summation of the operations in each of the neurons that compose the system. The information is entered into the system by a vector X i = (X 1 , X 2 , …X n ) (Fig 1).
All information in the input vector is processed by a mathematical function that transfers this information to the first intermediate layer. The propagation function commitment is to add all the input data and generate a single response (Fig 1, Equation A). In this equation N is the neurons number in the first Neural network layer, denominated input layer, w ni is the weight (which indicates how important the connection is) between neurons in input layer (n) and neurons in intermediate layer (i), and finally, b i corresponds with the biases associated to the neurons in the intermediate layer (Eq 3). The values obtained by the propagation function are used by other mathematical function, called activation function (Eq 4), to provide an output value (y i ) as a function of the internal state [24] and exceeds a threshold value [25]. Different activation functions can be used but in this work it was used the sigmoidal function (Fig 1, Equation B).
All information entered in the Neural network is propagated to the output layer, where an output value is generated (y 0 ). This value is compared with the experimental value (d 0 ), and the error produced by the Neural network (Eq 5) can be calculated.
Implementation procedure and choice of the best Artificial Neural Networks. The first step to develop the different Neural Network models is the selection of the data (training data) that we use to train the system (training phase) and then the data (validation data) for check the prediction power of different Neural Network models (validation phase). As discussed in the previous section, there are 167 tomato samples analysed, 151 had been chosen to be part of the training phase and 16 were reserved for the validation phase and to check the prediction power of different Neural Networks implemented.
In this work we have implemented a high number of Neural Networks to achieve the desired values of cultivar, production type and harvest date. Once all Neural Networks have been developed, we need to check the good predictive power for training and validation phase. Traditionally, the predictive power is calculated by different statistical parameters such as; i) the Root Mean Square Error (RMSE) comparing predicted values and real values, ii) the Individual Percentage Deviations (IPD) or iii) the Average Percentage Deviations (APD). These type of calculations are made when the output variables are continuous, however, in this paper we are working with discrete variables that can only take a specific value. For this reason we have calculated the predictive power of different Neural Networks as a function of the percentage of success (match between the predicted and the real variables: Harvest date, Production and Cultivar), such as the Average Percentage of Success (APS) (Eq 6).
For the implementation of different ANN models we used EasyNN plus, Version 14.0d, by Neural Planner Software. The program was installed in two different computing equipment; i) a Personal Server with an Intel Core i7 processor with RAM memory of 8 GB, and ii) a Personal Server with an Intel Core i5 processor with RAM memory of 4 GB, both servers with virtual machines.
Notation for the developed models. To clearly identify the different topologies of ANN developed, we used the following notation that considers each neuron in the different layers of the neural network model.
Where N input layer and N output layer represents the neurons in the input and output layer, that's the neurons that receive information from the outside, and the neurons that generate output to the input information. N intermediate layer corresponds with the neurons in the intermediate layer.
Input variables reduction for simple models (ANN 3 and ANN 4 ). To reduce personal costs, material and analytical cost, and time, we have reduced the input variables of the first models (models ANN 1 and ANN 2 with 25 input variables) based on the importance of all input variables. The importance was determined by the sum of absolute values of all weights between the input neuron and all intermediate neurons. The new 10 input variables for simple models ANN 3 and ANN 4 were; Glucose, Lycopene, P, Na, K, Mg, Fe, cholorogenic acid, caffeic acid and p-coumaric acid. The results of ANNs model are provided as supplementary material (S4 Table).

General composition of tomatoes and influential factors
The average chemical composition of the tomato samples and the factors that significantly influence this composition are shown in Table 1. The mean contents of fructose (1.28±0.41%) and glucose (1.29±0.40%) were similar and within the range of concentration found by Cebolla-Cornejo et al. [5]. According to the GLM results, the mean glucose content depends on more factors than fructose. The fructose content only depends on agricultural practice and date of harvest. In comparison, Gautier et al. [26] observed that light and temperature had no significant effects on final sugar content. The mean content of total fibre in the analysed tomatoes was 1.81±0.56% and its content depends on all factors and their interactions (Table 1). Claye et al. [27] observed that the tomato fibre was composed of 87% insoluble fibre and 13% soluble fibre. The mean protein content obtained (0.80±0.15) is significantly influenced by harvest date and production system.
The mean content of phenolic compounds (20.41±4.37 mg galic acid/100g) and lycopene (2.31±0.72 mg/100 g) were similar to those concentrations reported by Slimestad and Verheul [28]. The mean content of lycopene is more significantly influenced by the collection date while the phenolic compounds vary according to the harvest date and the interaction between cultivar and harvest date. These results agree with those of Tedeschi et al. [29].
The individual mineral content of the tomato samples (Table 1) were similar to those concentration found by Tedeschi et al. [29] and Max et al. [30], except for Na (92.4±63.4 mg/Kg) that was highest. The high Na concentration in Tenerife soil, which is derived from alkaline volcanic rocks, the high salinity of the water used in the irrigation, and the influence of the marine aerosol could explain the relatively high concentration of Na in the tomatoes [31]. In Table 1, one can see that the mean content of Na varies mainly with production system (the lowest pvalue). Our results for Ca, Fe, Zn, Mn and Cu were near to those found by Gundersen et al. [32]. The main significant differences (the lowest p-value) are linked to the agricultural practices and collection date, except for the Fe, which is more significantly influenced by the tomato cultivar and the interaction between the production system and the harvest date. Regarding the role of minerals in determining tomato fruit quality, Fanasca et al. [33] suggest that new trials are required to understand the interaction between every element and the fruit tomato quality.
Citric acid (354±121 mg/100 g) was the major organic acid followed by malic (78.3±40.2 mg/100 g) and oxalic (25.6±9.3 mg/100 g). Citric acid is the main agent responsible for the acidity of tomatoes and its concentration was similar to values found by Cebolla-Cornejo et al. [5]. However, the concentration of malic acid of our tomato samples were lower than values found by these authors. Malic acid plays a key role as an important indicator of the freshness of fruits. Oxalic acid forms insoluble salts with calcium and other essential divalent cations producing a decrease in the bioavailability of these nutrients [34]. Oxalic acid and fumaric acid depend on individual factors, agricultural practices and collection date, respectively, while the mean value of ascorbic acid is more significant influenced by the interaction of agro-climatic factors. The mean concentration of ascorbic acid (15.3±4.48 mg/100 g) obtained in this research was very similar to the mean concentrations found by Thybo et al. [35]. The mean content of the rest of the organic acids is more significantly influenced by various factors and their interaction (Table 1).

Characterization of the tomato samples
The purpose of GLM-AID is to identify and highlight the main significant differences in order to identify the most relevant phytochemical constituents and characterize the tomato samples. Its results were organized as a tree-structure with 3 levels according to the number of factors (harvest date, agricultural practice and cultivar). In some nodes, new divisions were not possible due to: i) more than one chemical parameter was necessary to distinguish among cases, ii) there were no significant differences among the chemical parameters of the samples inside the node or iii) some cases with only a single sample, the mean values cannot be compared.
The first level of hierarchy was the harvest date. The first chemical parameter selected was the p-coumaric acid that divides the samples according to each harvest date. Tomato samples harvested in October More difficulties were observed in analysing a pattern between production methods and chemical composition (2 nd level of the tree-structured). The tomato samples collected in October (Fig 2) had the significantly highest concentrations of p-coumaric acid (node 1). Regarding agricultural practices, it had significant influence on glucose content. Tomatoes conventionally obtained had the highest values of glucose (node 7, 1.21±0.12) with respect to the other two practices. The malic acid content allows to distinguish the conventional tomato cultivars: Thomas (node 8, 24.8±10.4 mg/100 g), Dorothy (node 9, 42.8±12 mg/100 g) and Boludo (node 10, 59.8±9.7 mg/100 g). Organic and no-soil tomatoes had the lowest glucose content (node 2, 0.85±0.18%). The Mn content allows to distinguish between both practices, 0.62±1.2 mg/kg for organic tomatoes and 0.86±0.2 mg/kg for no-soil tomatoes. Tomato cultivars organically produced were distinguished according to the fructose content (node 4 and 5).
Tomato samples collected in December (Fig 2) had a content close to the average of p-coumaric acid (node 11). The content of P and Cu allow to distinguish among agricultural practices. No-soil tomato samples had the significantly highest value of P (node 17, 300±56 mg/kg) in relation to the other two systems (node 12, 226±40 mg/kg). Organic tomatoes had the highest content of Cu (node 16, 0.33±0.13). The Dorothy tomato cultivar had the significantly highest content of ferulic acid (node 15, 0.13±0.03 mg/100 g) while the rest of tomato cultivars showed a similar chemical composition (node 14).
Tomato samples collected in February (Fig 2) had the significantly lowest p-coumaric acid content (node 18). Within this group, no-soil tomatoes had the lowest value of Na (node 19, 82.1±53.3 mg/kg) and Mg (node 20, 107±14 mg/100 g) and tomato cultivars belonging to this subgroup differentiate themselves by fructose (nodes 21 and 22). In contrast, organic tomatoes had the highest content of Mg (node 23, 125±9 mg/kg). Tomato cultivars produced by organic practices can be distinguished between them by the content in ferulic acid (nodes [24][25][26].
Regarding conventional tomato samples, they had the significantly highest content of Na (node 27, 146±72 mg/kg). One can observe two subgroups. The first is formed by the Dominique, Thomas and Dorothy tomato cultivars with the lowest content of malic acid (node 28, 62.6±21.4 mg/100 g) while the second group is formed by the Boludo and Dunkan tomato cultivars which had the highest content of this organic acid (node 32, 152±56 mg/100 g). The phenol content allows to distinguish between Dominique, Thomas and Dorothy (nodes [29][30][31] while the content of K discriminate between the Boludo and Dunkan tomato cultivars. p-Coumaric acid was not detected in tomato samples collected in April (node 35, Fig 2). One possible negative effect of weather conditions, especially linked to irradiation, may be the cause as discussed above. No-soil tomato samples collected in this month showed the Tools for Characterization and Authentication of Tomatoes significantly lowest value in Mg (node 36, 74.9±9.6 mg/100g). The Boludo, Dorothy and Dominique tomato cultivars within this subgroup were different according to the Mn content (nodes 37 and 38). Organic and conventional tomatoes had the highest values of Mg (node 39, 112±16 mg/100 g). The difference between both systems was the content in caffeic acid, 0.02±0.00 and 0.03±0.01 mg/100 g respectively. The organic tomato cultivars (nodes 41, 44 and-45) can be distinguished by the content in pyruvic acid. In contrast, the tomato cultivars conventionally obtained were significantly different in the Ca, total fiber and Na content (nodes 47-52).
In order to distinguish tomato samples according to the three agricultural practices, the most significant chemical parameters were glucose, Mn, P, Cu, Na, Mg and caffeic acid. Tomatoes conventionally obtained had the highest mean values of glucose, Na and caffeic acid, while the organic tomatoes had higher concentrations of Cu. No-soil tomato samples had the lowest concentrations of these chemical parameters except for P.
In a previous study HJ-Biplot was used [21]. HJ-Biplot and GLM-AID offer supplementary information. GLM-AID identifies the main chemical parameters responsible for the differences among samples while HJ-Biplot [38] revealed simultaneously the next information: correlation among chemical parameters, similarity among tomato cultivar samples and relationship among tomato samples and chemical parameters. The biplot also showed that when tomatoes have the same degree of ripeness: 1) the climatic conditions may have been more relevant to distinguishing these two groups than the kind of agricultural practice, 2) sugars, organic acids, protein, Mg, and Na were strongly correlated with similar profiles for the conventional and organic samples, 3) with good management practices the type of agricultural practice had little effect on the chemical composition to distinguish between conventional and organic tomatoes, 4) antioxidant compounds are mainly contained in conventional and organic tomato samples harvested in October and 5) unlike the previous case, the kind of agricultural practice seems to have more effect on mineral concentration than the climatic conditions.

Selection of nutritional markers to authenticate the tomato samples by artificial intelligent models
As mentioned in the previous section, the development of an ANN requires the implementation of many models, using a trial and error method, to obtain the best Neural Networks model to determine the output variables. We implemented over one thousand Neural Networks with different number of input variables, with different topologies (varying the number of neurons in input and output layer) or with different training cycles (to avoid over fitting of the Neural network) to determine the best Artificial Neural Networks to predict the cultivar, production type and harvest date of a tomato. Four predictive models were evaluated: • Neural network with three outputs and all input variable (ANN 1 ).
• Neural network with a single output and all input variable (ANN 2 ).
• Neural network with three outputs and selected number of input variable (ANN 3 ).
• Neural network with a single output and selected number of input variable (ANN 4 ).

ANNs with all variables in input layer, 25 variables
The first implemented neural network models (ANN 1 and ANN 2 ) were developed with all the variables available for tomato samples (Table 1). In this sense, two types of prediction models were studied, first ANN 1 group with a three output variables, that is, a Neural Network to predict simultaneously the cultivar, the production type and the harvest date, and other group ANN 2 with three individual neural networks to predict each variable. In Table 2 we can see the APS for the best Neural Networks implemented for each type of output variables selection, that is, for groups ANN 1 and ANN 2 implemented with 25 input variables.
As expected, the fits for individual Artificial Neural Network (ANN 2 ) were better than the fits for three output variables group (ANN 1 ). In Table 2 we can see that all ANN 2 produce better results for the training phase, except for the Harvest date prediction where the ANN 1 corresponding to the topologies 25-41-3 and 25-50-3 presented a better adjustment (100%).
The goal of this study is to implement a tool, based on Artificial Neural Networks, to predict the tomato cultivar, production type and harvest date of tomatoes. The choice of the best ANN should not be based on the higher APS in the training phase but in the best APS for validation phase. In the validation phase, all the output values are assumed unknown. Those unknown values are found with the program and later compared with the measured values. This procedure gives us a good idea of the prediction power of the different models to future cases. As we can see in Table 2, the individual model ANN 2 produces the best prediction results. In this sense, we can see that the individual networks to predict production type and harvest date provide good results, 87.5% and 100% respectively.
A different case is the tomato cultivar. One can see that all models developed presents fits below the fit for production type and harvest date. This behaviour suggests that the 25 nutritional compounds (listed in Table 1) are not suitable for cultivar identification. The inclusion of other compounds is needed for a good prediction of the tomato cultivar.

ANNs with selected variables in input layer, 10 variables
Due to the good results obtained with all available variables, new ANN models were developed in order to decrease the number of variables in the input layer. These new ANNs are based on the need to reduce personal costs, material and analytical cost, and time. If ANN modelling is used to detect food fraud, those savings can be important. The ANN 3 models, with three output Table 2. Average Percentage of Success (APS) for the training, validation and average phases (mean APPS for training and validation phase together) considering all variables for harvest date (APS H ), production type (APS P ) and the tomato cultivar (APS C ) for models with 25 input variables (ANN 1 and ANN 2 ) and models with 10 input variables (ANN 3 and ANN 4 ). Tools for Characterization and Authentication of Tomatoes variables simultaneously, and ANN 4 models, with one output variable, were developed with 10 input variables of all variables available for tomato samples ( Table 2). The fits for three output Artificial Neural Networks (ANN 3 ) are, in general, better than the fits for individual output Neural Networks (ANN 4 ). In Table 2 we can see that all ANN 4 have improved for the training phase, except for the cultivar prediction where the three output ANN 3 (topology 10-13-3) present a better adjustment than the single ANN 4 (10-18-1), 82.8% and 61.6% respectively. Contrary to our expectations, the best ANN, in validation phase, for prediction of the tomato cultivar and production type were the ANN 3 with three output, 10-13-3 (43.8%) and 10-9-3 (93.8%), respectively, and the best ANN 4 to predict the Harvest date is the individual ANN 10-18-1.
Since the aim of this study is to develop a method to detect food fraud by an economical way, quickly and easily, we propose the implementation of a prediction model based only in 10 input variables.
Individual output models are chosen when they offer good individual results, while the multiple outputs models are chosen in function of the average results. However, in this study we propose the model development that use for each output variable the best prediction for the specific output: the model that provides the outputs cultivar and production type, corresponding to the ANN 3 type (10-13-3 and 10-9-3) and the output of the individual network ANN 4 (10-18-1) that provides Harvest date. Fig 3 shows the results obtained by the neural networks developed for each output variables, harvest date, production type, and tomato cultivar. This table is divided into three zones, a first zone (upper zone) represents the results for the neural network that predicts the harvest date, a second zone (intermediate zone) for the production type variable, and finally, a third zone (lower zone) for variable cultivar. Each zone has different colored squares that represent specific cases used in the neural network. The squares grouped in the first ten columns on the left are the cases used for the training phase, and the sixteen cases located in the right column are the cases used to validate the networks developed. Inside each square, we can see the variable status (show legend) for each of the cases studied in this paper. Green squares show each of the cases in which the neural model predicts right the output variable (in training or validation phase), while, red squares show a bad prediction of neural model (that's error prediction). We can see that for the harvest date the model ANN 4 (10-18-1), offers good results in training (98.7% success, Table 2) and validation phase (100% success, Table 2), likewise, we can see that in the training phase the model makes two errors (Fig 3), one of them for a tomato picked in December (D) and another for tomato harvested in April (A). If we look at the percentage of correct classification for each of the four seasons harvest, for training and validation phase, we can see that the success for October and February seasons are 100%, while for December and April seasons is 98.1 and 96.9%, respectively (Fig 3). On production type variable, prediction model ANN 3 (10-9-3) offers good results in training (84.1% success, Table 2) and validation phase (93.8% success, Table 2). The only classification error committed in the validation phase Tools for Characterization and Authentication of Tomatoes belongs to organic production type (O) whose group offers the less percentage of correct classification (70.7%, Fig 3) in the data set to study. Finally, we can see the prediction model ANN 3 (10-13-3) for cultivar output, this is the group with the worst results for training (82.8%, Table 2) and validation phase (43.8%, Table 2). We can see that in general all cultivars have a good percentage of correct classification, more than 80%, except in cases of Dominique (Dq) and Boludo (B) present 68.4 and 65.2%, respectively (Fig 3). It is notable the behaviour of Dominique (two cases) and Dunkan cultivar (other two cases) in the validation phase, where we can see how the select predictive model is not able to correctly characterize them (Fig 3).
Comparison of the models GLM-AID allows a completely separation of tomato samples according to the harvest date (100%) in the first level of the tree-structure. This result agrees with the results for harvest date obtained with the neuronal model developed ANN 4 (10-18-1), 98.7% and 100% of APS for training and validation phases, 98.8% average. At each collection date, the GLM-AID method produced the right classification of tomatoes according to its production system. However, a complete distinction between the tomato cultivars was not possible with this method. In this case (production system), the prediction model ANN 3 (10-9-3) provides good adjustment, 84.1% for training phase and 93.8% for validation phase. However, in GLM-AID method, the separation of the cultivar tomato samples was not possible. Similarly, the results with the model prediction model ANN 3 (10-13-3) give 82.8% for training phase and 43.8% for validation phase. The obtained result can be extrapolated to others harvest seasons, however, we recommended improve these present models with new harvest seasons and other growing areas to improve their prediction before their use around the world. Even models can be improved using local variables such as; climate variables, geographical variables, etc., that can be affect the nutrient content of tomato cultivars.
Regarding the tomato cultivars, they shown a low percentage of classification. Some authors [39,40] suggest a simultaneous analysis of genetic diversity with various molecular markers in order to obtain a higher efficiency in the levels of genetic variability estimation due to the correlation between data similarity matrices from molecular markers and morphoagronomic. Similarly, our proposed model can be improved by other methods, which would contribute to greater reability to the results obtained.

Conclusions
Both methods developed, GLM-AID and ANN, have selected different nutritional parameters. The AID analysis does not provide a function that measures the relationship between the dependent variable and the independent variable. However, the AID analysis can be used with other statistical techniques to complete the analysis and determine the relative importance of the different independent variables. In comparison, the ANN creates a function and certain relationships among variables that are more affective in the approximation to each output variable allowing a prediction, more or less accurate, for each output studied variable.
GLM-AID has identified the most significant chemical parameters linked to the harvest date and the production system, and even associated to same tomato cultivars, which can be considered as the main significant differences among tomato samples.
GLM-AID showed that weather conditions have the main influence on the chemical composition of tomatoes followed by the effect of the production system. Once identified those chemical parameters, the tomato cultivars were not sequentially segmented. It can be due to a similar nutritional composition of the tomatoes cultivars.
The ANN models showed different results, where the percentage of classification according to harvest date was 100%, slightly reduced for the production system with 93.8%, and finally the lowest percentage, 50%, for the tomato cultivar. One possible explanation is that the nutritional parameters are not suitable for the identification or classification of tomato cultivars so other kind of markers should be considered. Nevertheless, the combination of GLM-AID and ANN models can be useful to reduce experimental time and cost, personal costs, allowing the use of this predictive models in food fraud detection.
Supporting Information S1