Genetic programming based models in plant tissue culture: An addendum to traditional statistical approach

In this paper, we compared the efficacy of observation based modeling approach using a genetic algorithm with the regular statistical analysis as an alternative methodology in plant research. Preliminary experimental data on in vitro rooting was taken for this study with an aim to understand the effect of charcoal and naphthalene acetic acid (NAA) on successful rooting and also to optimize the two variables for maximum result. Observation-based modelling, as well as traditional approach, could identify NAA as a critical factor in rooting of the plantlets under the experimental conditions employed. Symbolic regression analysis using the software deployed here optimised the treatments studied and was successful in identifying the complex non-linear interaction among the variables, with minimalistic preliminary data. The presence of charcoal in the culture medium has a significant impact on root generation by reducing basal callus mass formation. Such an approach is advantageous for establishing in vitro culture protocols as these models will have significant potential for saving time and expenditure in plant tissue culture laboratories, and it further reduces the need for specialised background.


Author summary
Trials to find out the best combination of factors that contribute to the desired response takes up the chunk of time and resources in any plant tissue culture laboratory. The output of such experiments is analysed statistically to come to a conclusion. However, without prior statistical modifications, the results could be misleading. Recent reports from several labs point out the use of artificial neural networks to circumvent this. We have chosen to use a computational process that can predict the best combination of factors for the desired response after randomly testing the higher and lower limit of the factors with experiments. The magnitude of the desired response can be presumed at any concentration within this range using the models generated by symbolic regression. The procedure provides both optimum model function as well as the optimum variable values in the model. The variable sensitivity and percentage response add depth to the information thus obtained. The study indicated that these models would have significant potential for

Introduction
Relatively more straightforward and efficient empirical modeling techniques based on inputoutput models are gaining popularity to conventional statistical methods across various disciplines [1]. This surge is due to its relative ease of use and understanding. Genetic programming (GP) is an approach which uses the concept of biological evolution to handle a problem with many fluctuating variables. Computational optimisation techniques have recently debuted in plant tissue culture research as studied in neural networks models [2]. Symbolic regression was one of the earliest applications of GP and continues to be widely considered [3]. A broad array of scientific fields like Biology, Chemistry, Environmental Science, Neurology and Psychology reports the use of symbolic regression [4][5][6][7][8][9]. However, plant tissue culture data has not yet been analysed using symbolic regression. The data generated from plant tissue culture experiments includes continuous, count, binomial or multinomial and predominantly the information is validated using analysis of variance method (ANOVA) [2,10]. ANOVA is adequate for normally distributed continuous data; but without prior manipulation, it is erroneous to analyze count, binomial or multinomial data [11]. Neuro-fuzzy logic is the standard practice by which computational modeling is achieved in plant tissue culture [2,12]. In this context, genetic algorithm based symbolic regression remains unevaluated. Unlike conventional regression analysis which optimises parameters for a pre-defined model, symbolic regression avoids imposing any apriori assumptions. In generalised linear model (GLM) regression, the dependent variable is represented as linear combination of the given set of basic functions and optimise the coefficients to fit the data. However, symbolic regression searches for both a set of basic functions and coefficients. The added value of symbolic regression, compared to GLM, lies in its ability to quickly and accurately find an optimal set of basic functions [13,14]. The algorithm infers the model from the data by combining variables and mathematical operators and generates an empirical formula which is a mathematical equation that predicts observed results derived from conducted experiments. GP combines previous equations and forms new ones. Thus it produces models with interpretable structure, relating to input and output variables from a data set without pre-processing and identifying critical parameters and hence shed insight into the underlying processes involved in a given system [15]. Symbolic regression can recognise and model complex non-linear relationships between the inputs and outputs of biological processes even in the presence of disturbances and potential for parallel processing. The preliminary data generated from experiments during rooting of in vitro regenerated plantlets in Wrightia tinctoria was employed to study the utility of symbolic regression to analyze plant tissue culture data. The effect of two variables -NAA and charcoal on root proliferation was considered. The datasets were subjected to usual statistical analysis as well as observation based modeling via symbolic regression. Moreover, we aimed to optimise the process by examining the influencing factors. We propound the use of symbolic regression-based model prediction as an addendum to data analysis method for plant tissue culture experiments.

Culture conditions
The genetic variability was kept minimum by using a single field grown ortet, thus minimising statistical errors [16]. Nodal regions derived from the fresh flushes of growth from the ortet, two weeks after lopping one major branch served as the explants [17]. The nodal explants were conditioned over a period of 4 months (subculture/four weeks) on MS medium (1962) [18], pH 5.8 and 2 μM each of BAP and NAA for shoot multiplication. For rooting experiments individual shoots were transferred on MS medium containing 2 μM BAP with NAA (2, 4 and 6 μM, respectively) and charcoal (0.01, 0.03, 0.05, 0.07, 0.09 and 0.11%, respectively) in 250 ml culture flasks in 50 ml of sterilized medium (pH 5.8). The cultures were maintained at 25±2˚C in a culture room with 40 μmolm −2 s −1 irradiances and a photoperiod of 8 hrs with 55±5% of relative humidity.

Experimental design
The plant tissue culture database, containing 21 conditions, followed a factorial design for two variables-concentration of NAA (2, 4 and 6 μM) and charcoal (0,0.01, 0.03,0.05,0.07,0.09 and 0.11%) in the medium. Each treatment consisted of 5-7 explants in a culture flask with three replicates. The subculture was done at the end of four weeks and five parameters were recorded to analyze the effects of the variables on rooting such as basal callus diameter (mm) (BC), the percentage of shoots rooted (R), the length of the longest root (cm) (RL), the number of roots (NR) and the number of lateral roots (NLR) (S1).

Statistical analysis
All experiments were conducted using Randomised Block Design (RBD). Continuous data were analysed using multiple linear regression in R and posthoc comparisons of pairs performed by Tukey's test (p>0.05). Count data were analysed using Poisson regression model. Pearson's Chisquared test for count data was employed to access statistical significance of the variables.

Symbolic regression
Each of the observed parameters is modeled as a function of NAA and charcoal concentrations using symbolic regression and GLM for comparison. To obtain a global optimum, we have also modelled the combination (R+RL+NR+NRL-BC) by taking rooting factors together after normalisation by employing both GLM and symbolic regression. The optimum model for each case was generated by genetic programming based symbolic regression using the software package Eureqa (Version 0.98 beta) with 50% of the data randomly selected as training data, and 3-fold cross-validated with randomly selected 25% of the remaining data [19][20][21]. Corresponding to each symbolic regression model of the data partition, we have also obtained generalised linear model by including x, y, xy, 1/x, 1/y, sin(x), cos(x), sin(y), cos(y), xy sin(x), xy cos(x), xy sin(y), xy cos(y) into the set of basic functions and cross-validated similarly. The remaining 25% the data was used for testing and reporting error [19][20][21]. The Target expression used to generate the regression model was the minimal equation z = f(x, y) where 'x' corresponds to NAA concentration and 'y' corresponds to charcoal concentrations, and 'z' represents each of the five observed parameters and their combination. The models were based on the primary and trigonometric building blocks, with the R 2 goodness of fit as the error metric [22,23]. Root Mean Squared Error (RMSE) was calculated for the test data sets. Sensitivity represents the relative impact of the variable on the parameter studied within this model and was calculated by the local method using the partial derivatives [24]. Given a model equation of the form z = f(x, y), the influence metrics of x on z was; Sensitivity ¼ @z @x : sðxÞ sðzÞ ; evaluated at all input data points; The percentage positive was calculated as percentage of data points where sðxÞ sðzÞ > 0 and percentage negative was calculated as percentage of data points where sðxÞ sðzÞ < 0; where @z @x was the partial derivative of z with respect to x, σ(x) was the standard deviation of x in the input data, σ (z) was the standard deviation of z, |x| denoted the absolute value of x, and " xdenoted the mean of x [25]. The 'fmin' function in MATLAB (R2012b) was used to obtain the maximum value of each of these functions.

Results and discussion
The average values obtained for the five growth parameters observed during the study were given as the basal callus diameter (Table 1), the percentage of shoots rooted (Table 2), the length of longest roots (Table 3) and the number of roots and the number of lateral roots ( Table 4). The miniscule alphabets within a column indicated the significant influence of charcoal and majuscule alphabets in the row represented the significant interaction of NAA. The shoots inoculated on MS medium with 0% charcoal (control) showed maximum basal callus formation (Fig 1). The shoots inoculated on MS medium supplemented with 4μM NAA and 0.07% charcoal showed the maximum percentage of rooting (Fig 2).
Multiple linear regression demonstrated a significant effect of NAA and its interaction with charcoal on basal callus (p>0.001), the percentage of shoots rooted (p>0.05) and root length (p>0.01) (Tables 1-3). The individualistic effect of NAA for the number of roots and lateral roots were found to be significant at p>0.05 and p>0.001 respectively ( Table 4). The interaction of NAA and charcoal was not significant for the same parameters studied. Mathematical Symbolic regression to plant tissue culture data: An addendum to routine approach functions were successfully developed using symbolic regression to understand the correlation between the two variables for each of the parameters considered and is contrasted with those obtained by traditional regression models (Table 5). To analyze the effect of each of the variables on the parameter studied; variable sensitivity measures were calculated along with its percentage impact. Its sensitivity denoted the relative impact within this model that a variable has on the target variable. The individualistic effect of the two input variables on the output parameter was pointed out as percentage positive or negative of that input variable (Table 6). For the parameter basal calli diameter, the percentage positive value for variable 'y' was zero. In other words, there was zero percent chance of basal calli mass increasing with increasing concentration of charcoal; or that basal calli mass decrease with increasing concentration of charcoal (Fig 3). The model predicted that increase in charcoal concentration had a consequent increase in root length and root number in 50% of all the trials while the same promoted rooting percentage and lateral root number in 75% of the trials. Root number and root length decreased with increasing concentration of NAA in 100% of the trials. Rooting percentage and lateral root numbers increased with increasing NAA concentration in 50% of all the trials. The function obtained and the 3D plots thus generated could be used to predict the combinations of input variables giving optimum results. The best response for rooting percentage Symbolic regression to plant tissue culture data: An addendum to routine approach was predicted at 3.7 μM NAA and 0.08% charcoal (Fig 4). The root length showed a non-linear pattern, and the highest value for its function was estimated with 2.8 μM NAA and 0.05% charcoal (Fig 5). The maximum root number was determined for 1.7 μM NAA and 0.06% charcoal (Fig 6). The maximum value of the function generated for lateral root number was with 6.3 μM NAA and 0.08% charcoal (Fig 7). The global optimum modelled upon the combination (R+RL+NR+NRL-BC) indicated the results as 2.44 μM NAA and 0.03% charcoal (Fig 8). Symbolic regression to plant tissue culture data: An addendum to routine approach The conclusion obtained by traditional statistics suggested that charcoal had a positive and stimulatory effect in rooting of shoots by reducing basal callus ( Table 1). Percentage of shoots rooted and root length showed a significant impact with the combination of NAA and charcoal (Tables 2 and 3). In the present study, NAA has a significant effect on rooting as shown by the number of roots and lateral roots (Table 4). Similar results were reported in Acacia leucophloea and Cinnamomum verum [26,27]. With traditional statistics, we were not able to estimate the combination/s of both variables in producing the best results or able to identify the relative impact of a particular variable on the output parameter. Modeling of plant tissue culture data is practised using regression analysis where first an initial function is approximated and the data fitted to that function to obtain the optimum parameters [11,28,29]. In this procedure even when one gets the optimum parameter values, the model prediction was limited by the probable wrong selection of the model function. In contrast, symbolic regression  The variable x represents NAA concentration and y represents charcoal concentration https://doi.org/10.1371/journal.pcbi.1005976.t005 Symbolic regression to plant tissue culture data: An addendum to routine approach procedures work simultaneously on model specification problem and the problem of fitting coefficients [30]. Thus it provides both optimum model function as well as the optimum variable values in the model. The simple relations derived from GP were more accessible to analyze the relationships between the input and output variables [31]. Observation-based predictive models using GP identified that the individualistic effect of charcoal was significant in all the output parameters. A previous investigation suggested basal callus mass formation as one of the primary constraints in the culture of this tree species [32]. In the present study, charcoal has a positive and stimulatory effect in rooting by reducing basal callus formation in shoots. For each of the functions, generated values can be obtained by increasing /decreasing the  Symbolic regression to plant tissue culture data: An addendum to routine approach variables by a unit. After randomly testing the higher and lower limit of the additives with experiments, the magnitude of the observed parameters can be presumed at any concentration of the additives within this range using the models generated. It can be extended to analyze synergistic interactions between two parameters by testing whether increasing both variables by a unit, gives a higher or a lower value than the sum of the values obtained by increasing each individually by a unit. The basic requirement for any empirical model includes interpretability, robustness and reliability [33]. Symbolic regression gave comparably lesser RMSE values in comparison to multiple linear regression, thus adding validity to its use. In plant tissue culture obtaining an optimum model is crucial when one needs to find the optimum experimental parameters for large-scale production. The procedure adopted in the work can also be Symbolic regression to plant tissue culture data: An addendum to routine approach extended to similar experiments as it is general and computationally efficient. The analysis predicted the optimum concentration of medium for micropropagation of the selected tree species from the model plots derived from the preliminary experimental data. The study indicated that these models would have significant potential for saving time and expenditure in plant tissue culture laboratories for the commercial establishment of in vitro protocols in tree species. Symbolic regression to plant tissue culture data: An addendum to routine approach