Overcoming collinearity in path analysis of soybean [Glycine max (L.) Merr.] grain oil content

Path analysis allows understanding the direct and indirect effects among traits. Multicollinearity in correlation matrices may cause a bias in path analysis estimates. This study aimed to: a) understand the correlation among soybean traits and estimate their direct and indirect effects on gain oil content; b) verify the efficiency of ridge path analysis and trait culling to overcome colinearity. Three different matrices with different levels of collinearity were obtained by trait culling. Ridge path analysis was performed on matrices with strong collinearity; otherwise, a traditional path analysis was performed. The same analyses were run on a simulated dataset. Trait culling was applied to matrix R originating the matrices R1 and R2. Path analysis for matrices R1 and R2 presented a high determination coefficient (0.856 and 0.832, respectively) and low effect of the residual variable (0.379 and 0.410 respectively). Ridge path analysis presented low determination coefficient (0.657) and no direct effects greater than the effects of the residual variable (0.585). Trait culling was more effective to overcome collinearity. Mass of grains, number of nodes, and number of pods are promising for indirect selection for oil content.


Introduction
Soybean has been very important to Brazilian agribusiness. The oleaginous crop comprises approximately 20% oil by weight [1] and this product has been used in several applications, especially as kitchen oil and biodiesel. The crop is an important source of oil and ranks first in oil production in the world scenario [2] and the global demand for soybean oil is arising [3].
Quantitative traits, such as soybean oil, protein content, and grain yield are quantitative traits [4,5], and are therefore controlled by a large number of genes, undergo a strong environmental influence, and usually show low heritabilities. Thus, the indirect selection is recommended to achieve genetic gains [6]. The viability of indirect selection is commonly investigated by correlations studies [7]. However, [8] emphasize that correlations among traits are purely a measure of association and do not investigate the cause-and-effect relationships and according to [9] the association between two traits can be influenced by a third trait or a group of them. [10] proposed path coefficient analysis, which is a multiple regression expanded and allows unfolding the correlation coefficients of direct and indirect effects on a basic variable [11]. Path Analysis was first used in plants by [12] and it was later used in other crops.
The estimation of the path analysis coefficients can be adversely affected by the multicollinearity effects among traits. Multicollinearity occurs when the observation of the independent variables, or their combinations, are correlated [11,14]. When two variables are highly correlated it becomes hard to individually estimate the relations of each explanatory variable no the main variable because they contribute colletively to explain linear relations [14]. [13] advise performing the multicollinearity diagnostic on the correlation matrix of explanatory variables, since in presence of multicollinearity the variances associated to the path coefficients estimators may become too large and leads to unreliable estimates [14,15]. In presence of multicollinearity, [12] recommend using ridge regression, which can obtain the solution to the vector parameters, even with significant multicollinearity, by the sum of small constant values (k) in the diagonal of the correlation matrix, as proposed by [16,17]; or by excluding variables highly associated as presented by [18].
Many studies regarding path analysis on soybean that aimed to study the effects of traits on grain yield are available in the literature. [15,[19][20][21][22][23][24]. These studies show that it is usual to find high correlations among traits and, consequently, multicollinearity. Thus, methods to overcome multicollinearity must be studied, since most of these papers were not concerned to compare different methods for overcome collinearity.
Furthermore, despite of the large number of scientific papers about path analysis in soybean, no papers concerned grain oil content is found. Thus, the current study aimed to: a) understand the correlation among soybean traits as well as estimate its direct and indirect effects on oil content; b) verify the efficiency of ridge path analysis and trait culling to overcome colinearity problems.

Experimental conduction
The experiment was installed in 2016, February 22, in the Soybean Genetic Breeding Program's greenhouse of the Plant Science Department of the University Federal of Viçosa, Viçosa, Minas Gerais.
The experimental design was a randomized block with 5 replicates. Since the pots filled all tables of the greenhouse, we arranged the treatments in blocks in order to account possible environmental gradient. The treatments consisted of 34 F 1 soybean populations originated from a partial diallel and 15 commercial presented in S1 Appendix according to National Cultivar Registration [25]. Those cultivars used as genitors comprised between 5.5 and 9.0 [26]. The data of commercial cultivars were the ones used for statistical analysis. The plots consisted of two pots containing two plants of the same treatment (genotype). Sowing was performed in a substrate composed of three parts soil, two parts sand, and one part of organic substrate. The greenhouse was adapted using lamps in order to adjust the photoperiod and provide an appropriate environmental since the photoperiod is critical as the flowering inductor in soybean [27].
The experiment was carried in accordance with the recommendations of [28]. Twelve vegetative, productive, and grain quality traits were evaluated in each plant as they reached the appropriate phenological stage, according to [29] as described in Table 1.

Genetic-statistical analysis
The genetic parameters were estimated and the genetic values predicted by analyzing only the genitors' information. The statistical analyses were performed with the genetics and statistics software SELEGEN-REML/BLUP as described by [30,31], using the following statistical model: Where y is an (gr) x 1 vector of response variable, being g and r the number of genotypes and replicates, respectively; r is an (r) x 1 vector of unknown fixed effects fixed effects of replicates; g is an (g) x 1 vector of random effects of genetic values; and e is an (gr) x vector of random effects of error. X (g x r) and Z (r x g) are the incidence matrix. The significance of the random effect was obtained by the likelihood ratio test by the analysis of deviance as recommended by 30. The analysis of deviance is derived from the difference between the deviance of the complete model and reduced model missing the effect which must be tested using chi-squared test with one degree of freedom [30,32].
After obtaining the genetic values, we estimated the correlations between the traits. Then, a multicollinearity diagnosis in the original correlation matrix (matrix R) composed of 11 explanatory traits was performed. The multicollinearity diagnosis was performed considering the condition number as stated by, therefore, the matrices were classified as with weak, moderate, or strong collinearity for condition numbers lower than 100, between 100 and 1000, and greater than 1000, respectively [32].
Two procedures were used to address the problems caused by collinearity. Firstly, the path analysis through the ridge regression onto matrix R (Fig 1) as proposed by [16,17] and secondly, traditional path analysis after eliminating traits which showed high correlation with another (trait culling). For the path analysis with trait culling, two new correlation matrices were obtained: R 1 and R 2 by the elimination of different traits in each matrix, and the traditional path analysis were run as shown in the diagrams (Figs 2 and 3, respectively). The traits eliminated were chosen considering its correlation with other traits kept in the matrix and how often they are evaluated in soybean breeding programs, for example: HF and HM are highly correlated (0.92) and HF is less frequently measured in soybean breeding. Herewith, the reduction of collinearity was expected.
To establish the constant value K, the direct effects were plotted in a graph as a function of the values of K. The minimum value of K that provided stable estimates was chosen to run the analysis, as recommended by [12].

Simulation of a new dataset and validation of the path analysis procedures
To validate the results obtained with the experimental data, one hundred of new populations composed of 100 individuals were simulated using the means and covariance matrix of the original dataset in order to keep biological sense in the variables. The large number of simulated populations of few individuals avoids the reproduction of results obtained with original dataset.
Then, one correlation matrix was estimated for each population. After obtained the correlation matrices for each population the same procedures of path analysis were adopted to the simulated dataset. The same set of variables and procedures performed for R, R 1 and R 2 were also performed respectively for simulated matrices Rs's, R 1 s's and R 2 s's.
The best linear unbiased predictions (BLUPs) of the genotypes and variance components were estimated using the software SELEGEN-REML/BLUP [31]; and the correlation coefficients, path analysis, and multicollinearity diagnosis were performed for real data using the software GENES [33] and for simulated data using R software 3.5.3 [34].

Results
Significant genetic variability was found for all vegetative, reproductive and grain quality traits. The lowest and the greater heritability (h 2 ) estimate were found for OC (0.368) and DF (0.992) respectively (Table 2). Grain quality traits (OC and PC) and MG showed lower heritability estimated than most of traits. Accuracy (r âa ) values varied from 0.854 for OC to 0.992 for DF. The experimental coefficients of variation (CVe%) varied from 2.81 for PC to 20.3 for NS. Genotypic coefficient of variation (CVg%) ranged from 32.1 for HF to 3.00 for PC and CVg% values were greater than CVe% for all traits. The relative coefficient of variation, which is the ratio between the CVg% and CVe% ranged from 20.3 for HF and 2.81 for NP.
Genotypic values prediction for each genotype is presented in Table 3 and correlation coefficients estimated between genetic values are presented in Table 4. The correlation coefficients indicates that MG is high influenced by DF, DM, NN, and NP. Besides this, MG was more correlated to NP (0,79). With the respect to the oil content, NS showed the greater correlation with OC (0.75). The oil content was also strongly correlated with the yield components NN (0.64), NP (0.65), NG (0.67) and also with MG (0.67). High and significant correlations were found for number of pods (NP) and NN (0.88), and for NN and DM (0.86). Regarding the traits of grain quality, a high and negative correlation were found for PC × NS (-0.56), PC × NG (-0.53), and PC × OC (-0.60). The means of the correlation coefficients from the 100   simulated populations showed to be very similar to those estimated from the original dataset as also shown in Table 4.
Multicollinearity diagnosis analysis showed severe multicollinearity for matrix R; moderate, but very close to the limit to be weak for matrix R 1 (not contemplating the traits DF, HF and NB); and weak for matrix R 2 (not contemplating the traits DF, HF, NN and NB) ( Table 5).
There was no direct or indirect effect higher than the residual effect (0.585) and the determination coefficient was not satisfactory (0.657) for matrix R (Table 6). For the R 1 matrix, path analysis showed high determination coefficient (0.856) and the effect of the residual variable was outdone by direct effects of traits on OC ( Table 7). The traits DM, HD, HM, and NG presented direct effects with sign opposite to the correlation coefficient in analysis of the matrix R 1 . A weak association of DM and HM with OC was observed. On the other hand, NP, NN, and MG were noticed to be highly associated with OC by the direct effects 0.427, 0.632 and 0.967.
Strong direct effects of NN (0.804) and MG (1.05) on OC were found ( Table 8). The ERV (0.410) and R 2 (0.832) found for path analysis on matrix R 2 fitted better than the analysis of the matrix R, but not as well as the analysis of the matrix R 1 . Analysis of the matrix R 2 revealed results very similar to those of matrix R 1 .
The path analyses on Rs's matrices revealed a lower average k-value (0.334) and a higher determination coefficient (0.701) as presented in Table 9. The same promising traits (NP, NN and MG) for indirect selection in matrices R 1 were also appointed by path analysis on R 1 s's matrices whose average estimates are shown in Table 9. The average condition number estimated for simulated populations were slightly greater than those estimated for the original dataset.

Genetic control and variance components
The genetic variability noticed between soybean cultivars shows that they can be used as genitor in breeding programs. The variability between genitors is fundamental to generate segregating populations that provide genetic gains [7]. Low heritability and accuracy were expected for mass of grains and quality traits, since they are controlled by a larger number of genes and are more affected by environmental factors [5,6]. The values of accuracy found in this research are considered high according to [35]. These results indicate good experimental quality; thus, the data set and data analysis can be used in trustworthy inferences by researchers.
According a classification proposed by [36] the residual coefficients of variation (CVe%) were classified as low which indicates good experimental quality. The greatest value of CVe% was found for number of branches. According to [37] soybean plants have the ability to adjust the number of branches to compensate for the distance from other plants and intercept the maximum amount of light. The authors named this phenotypic plasticity. Despite equal spacing between pots, randomizing the treatments inside the blocks induced different levels of competition to the same genotype in different blocks. This effect was accounted for the environmental effects, therefore, decreases the coefficient of variance.

PLOS ONE
Overcoming collinearity in path analysis of soybean oil content According to [38] an statistic usually employed on genetic evaluations is the ratio between the genetic coefficient of variation and the experimental coefficient of variation which is called relative coefficient of variation (CVr) and is related to experimental accuracy. Estimates of relative CVr higher than 1 indicate more accurate and precise inferences about the rank of the genotypes, which was found for 10 traits with the exception of NB and OC.

Genetic correlation among traits and multicollinearity diagnosis
The traits days to flowering, days to maturing, number of nodes and number of pods were found strongly correlated to mass of grains. These results corroborate with results for genetic correlation found by [39]. The strong correlation between oil content and grain yield found is reinforced by [40]. The high correlation of number of nodes and number of pods with grains yield are reported in the literature as well as found in this research [24,39]. This is due to the grains yield comes from pods which are developed in axillary buds located on plant stems or branches originated from the nodes. Besides this, the high correlation between number of nodes and days to maturing shows that late maturing genotypes tend to present higher grain yield per plant. These results agree with those of [15].
A strong and negative correlation between oil and protein content were also found by [3,41,42] in molecular markers studies. [43] suggested that both traits are controlled by the same gene or group of genes, which were later reported as a major QTL in the crhomossome 20 by [44].
The elimination of traits that are strongly correlated and have biological interpretations such as DF and DM or HF and HM was shown to be effective to overcome multicollinearity. Thus, matrix R 1 or R 2 each contains just one of these traits from these pairs. This discussion is supported by the results of simulated data presented in Table 4 that showed very similar results for 100 populations. However the matrices comprised of the same set of variables presented the same level of collinearity problems.

Path analysis results and comparison among procedures
The poor adjustment of the ridge path analysis onto R and Rs's matrices and the greater effects of the residual variable in comparison with the direct effects, leads us to affirm that the estimates are not reliable (Tables 6 and 8). [15] obtained direct effect estimates higher than 1 for two traits in a traditional path analysis on a correlation matrix with strong collinearity problems for soybean. The authors also ran a ridge analysis, which gave estimates between 0 and 1, and these two traits did not show relevant direct effects. The K value used by the authors was of 0.05, whereas the K value needed to overcome collinearity in our original and simulated dataset were much greater. [12] affirm that high K values lead to higher bias in the regression and may provide results without biological mean.
High values for effects of residual variable (ERV) may be found when the number of explanatory variables is not enough to explain variation of the main trait. However, the estimates for matrix R 1 showed that even after reducing the number of explanatory variables, it was possible to explain a large part of the total variance. This result was also observed by [14] when they found greater regression adjustment after trait culling of the explanatory variables. [10] affirm that the exclusion of variables allows estimating path coefficients with biological meaning, what is an advantage of trait culling. A weak association of HM with OC was observed by the low direct and indirect effects, although HM showed significant correlation with OC. Likewise, a high correlation between PC and OC and weak direct effect was noticed. These results proves that the correlation coefficient does not necessarily give the best idea of association between variables and researches need to be careful at choose traits for indirect selection based just on correlation coefficient. On the other hand, NP, NN and MG are recommended to be used in indirect selection for OC, since the magnitude of direct effects were close to the correlation, which is the sum of direct and indirect effects involving each trait. In addition, these traits are easier, cheaper, and quicker to measure than laboratory procedures to measure OC. Besides this, NP and NN may be evaluated before complete maturation, which allows earlier selection, and MG is routinely evaluated in soybean experiments. In order to achieve genetic gains on OC we also recommend the use of a selection index including the traits that showed high direct effects on OC.
The direct effect of MG on OC from path analysis on R 2 matrix was 1.05, although estimates greater than 1 were not expected, once the matrix R 2 showed weak collinearity problems. The path analysis on R 2 s's matrices also presented means direct effects of MG on OC higher than 1. The strong and positive direct effects of NN and MG on OC showed in R 1 analysis were confirmed by path analysis on R 2 and R 2 s's matrices and encourage researches to use them for indirect selection. The same discussion mentioned above regarding PC in matrix R 1 can be applied to matrix R 2 .
Although the matrix R 1 showed moderate to strong problems with collinearity according to [45], the path analysis onto this set of variables was the most suitable for this study, since it provided the greatest determination coefficient of the regression, the lowest effect of the residual variable, and all effects were estimated inside the parameter space, which ranges from -1 to 1 for direct, indirect and residual variable effects. These results corroborate with those found by [14] when the authors tested three methods of path analysis (traditional, ridge path analysis and trait culling) and concluded that the trait culling provided a fit statistics more accurate. The discussions about the quality of the path analysis onto R 1 matrix were validated by path analysis onto R 1 s's matrices according to the great regression adjustment (0.844) and the high effect of the residual variable. Furthermore, the choice of the R 1 matrix allows considering one more variable in the study (NP).
The ridge path analysis was not an appropriated approach for overcome the collinearity problem, since a low R 2 (0.657) and high ERV (0.585) were found. Besides of this, no direct effects greater than ERV were found for analysis onto matrices R and R s's (Tables 6 and 8). Path analysis after trait culling for matrices R 1 and R 2 showed to be the best strategy for overcome collinearity problem since it presented a well-fit of the regression denoted by the values of R 2 (0.856 and, 0.832 respectively) and direct effects greater than the ERV (0.379 and 0.410, respectively). The results found onto simulated matrices supported these findings (Table 9). In addition, trait culling is a simple technique and allows to the researcher exclude the variable that contributes more to the collinearity considering its biological relationship with the main trait.

Conclusions
There is a strong relation of causation of mass of grains per plant, number of nodes, and number of pods with the oil content in grains and they could be used for indirect selection. Trait culling was more suitable and simpler strategy than ridge path analysis to overcome collinearity problems.
Eliminate one between two similar traits biologically was efficient to reduce condition number and, consequently, overcome multicollinearity.
Supporting information S1 Appendix. Maintainer company, cultivar name and to National Cultivar Registration (RNC) register number of 15 commercial soybean cultivars used for the research. (DOCX)