Path and Ridge Regression Analysis of Seed Yield and Seed Yield Components of Russian Wildrye (Psathyrostachys juncea Nevski) under Field Conditions

The correlations among seed yield components, and their direct and indirect effects on the seed yield (Z) of Russina wildrye (Psathyrostachys juncea Nevski) were investigated. The seed yield components: fertile tillers m-2 (Y1), spikelets per fertile tillers (Y2), florets per spikelet- (Y3), seed numbers per spikelet (Y4) and seed weight (Y5) were counted and the Z were determined in field experiments from 2003 to 2006 via big sample size. Y1 was the most important seed yield component describing the Z and Y2 was the least. The total direct effects of the Y1, Y3 and Y5 to the Z were positive while Y4 and Y2 were weakly negative. The total effects (directs plus indirects) of the components were positively contributed to the Z by path analyses. The seed yield components Y1, Y2, Y4 and Y5 were significantly (P<0.001) correlated with the Z for 4 years totally, while in the individual years, Y2 were not significant correlated with Y3, Y4 and Y5 by Peason correlation analyses in the five components in the plant seed production. Therefore, selection for high seed yield through direct selection for large Y1, Y2 and Y3 would be effective for breeding programs in grasses. Furthermore, it is the most important that, via ridge regression, a steady algorithm model between Z and the five yield components was founded, which can be closely estimated the seed yield via the components.


Introduction
Forages are the backbone of sustainable agriculture and environmental regeneration in arid land [1]. Perennial forage crops play a major role in providing high quality feed for the economical production of meat, milk and fiber products [2]. Perennial forage crops are also important in soil conservation and environmental protection [3], as they add organic matter to the soil and serve as a permanent ground cover preventing soil erosion [4]. In addition, perennial grasses are potentially useful for crop improvement as they possess important germplasm or genes for being tolerant to rigorous environment (field conditions) [5,6].
Russian wildrye (Psathyrostachys juncea Nevski) is a perennial grass, which is growing rapidly, highly drought and CaCO 3 tolerant and has a low fertility requirement [7,8,9,10]. Russian wildrye is a cool-season forage species well adapted to semi-arid climates [3,11]. It is a perennial bunchgrass and is characterized by dense basal leaves that retain their nutritive value better during the late summer and autumn than many other grasses [12].
Established stands of Russian wildrye provide excellent grazing for livestock and wildlife on semi-arid rangelands of the Intermountain West and the Northern Great Plains in North America [3,13,14]. Also, it is very competitive, high-yielding, an excellent source of forage for livestock and wildlife on semi-arid rangelands [12] in Eurasia and northwest China [4,9,10,11,15,16], and it is also an important forage crop for revegetating rangeland in North America [17]and northwest China [1,9]. In addition, Russian wildrye is cross-pollinated and relatively self-sterile [14]. It is the only agriculturally important species in the genus Psathyrostachys, which is a member of the Triticeae tribe [16,18] and is also considered to be an important germplasm in crop improvement as it possesses resistance to barley yellow dwarf virus (BYDV) [1,3,10,19].
There is a limited use of Russian wildrye due to its unsteadiness of seed production [1]. The reason is most probably that breeding programs has focused on developing Russian wildrys cultivars with a high biomass yield while improvement of seed yield has been neglected. Seed yield is a quantitative character, which is largely influenced by the environment and hence has a low heritability [20]. Therefore, the response to direct selection for seed yield may be unpredictable, unless there is good control of environmental variation. In order to select for higher seed yield there is the need to examine the mathematical relationships among various characters, especially between seed yield and key seed yield components and a certain amount of interdependence between them [21], e.g. seed yield components do not only directly affect the seed yield, but also indirectly by affecting other yield components in negative or positive ways [22]. In such situations, knowledge of the nature of genetic variability and interrelationships among seed yield and key yield components would facilitate with reference to breeding improvement for these traits [23]. Another possibility would be: To unravel the often complicated interdependence between seed yield components and seed yield knowledge of the nature on genetic variability and interrelationships among seed yield and seed yield components is important. This knowledge also merits future breeding programs in Russian wildrye. To our knowledge no information is available on the mathematical relationship between seed yield and seed yield components in Russian wildrye.
Path analysis provides a method of separating direct and indirect effects and measuring the relative importance of the causal factors involved. Several researchers have used this method to assess the importance of the components of yield [20,23,24,25]. The advantage of path analysis is that it permits the partitioning of the correlation coefficient into its components, one component being the path coefficient that measures the direct effect of a predictor variable upon its response variable; the second component being the indirect effect(s) of a predictor variable on the response variable through another predictor variable [26]. In agriculture, path analysis has been used by plant breeders to assist in identifying traits that are useful as selection criteria to improve crop yield [26,27].
For grass crops, the correlation of economic yield components with seed yield and the partitioning of the correlation coefficient into its components of direct and indirect effects have been extensively reported: e.g. highly significant associations of grain yield were observed with 1000-grain weight and tiller number per plant [28,29], the number of filled grains per panicle and harvest index [30]. Grain yield has been influenced by high direct effects of total tillers and days to flowering [31], the number of panicles per plant, the number filled grains per panicle and 1000-grain weight, the number of filled grains per panicle and plant height, productive tillers, panicle length and flowering time [21,32], plant height and tiller number, panicle number per plant, spikelet number per panicle, the number of effective tillers per plant, grains per panicle and 1000-grain weight, grains per panicle and productive tillers [33], the number of filled grains per panicle and 1000-grains weight [34] and biological yield, harvest index and 1000-grain weight, etc., but few of about grass seed yield components. Such detailed cause and effect mathematical relationships have not been examined in Psathyrostachys juncea Nevski.
However, morphological characters influencing yield are often highly inter-correlated, leading to multi-collinearity when the inter-correlated variables are regressed against seed yield in a multiple-regression equation. For such situations estimation of regression coefficients through ridge-regression was developed by Hoerl and Kennard [35] to ameliorate problems like inflation in absolute value of the regression coefficients and wrong sign of the regression coefficients resulting from these inter-correlated variables.
Based on multi-factor orthogonal design of various field experimental management, with big sample size, the main objective of this study was to examine the mathematical relationships between the seed yield (Z) and the key seed yield components: fertile tillers m -2 (Y 1 ), spikelets per fertile tillers (Y 2 ), florets per spikelet (Y 3 ), seed numbers per spikelet (Y 4 ) and seed weight (mg) (Y 5 ) in Russian wildrye. Then there are formulas theoretically. Seed yield: If one floret equals one seed embryo for grasses, then, Seed yield potential: The mathematical relationship was examined using path coefficient and ridge regression analysis. Our hypothesis was that: 1) all the five seed yield components and the seed yield are intercorrelated, and all the five seed yield components are positively contributed to seed yield and 2) the relationship between seed yield and the five seed yield components should be a steady algorithm model which can be closely estimated the seed yield via the components.

Results
Pearson correlation coefficients for all the four years totally shows that seed yield components Y 1 , Y 2 and Y 4 are significantly (P,0.0001) positive correlated with the Z, while Y 5 is significantly (P,0.01) negative correlated with the Z (Table 1). There was a negative significant correlation between Y 1 and Y 3 and between Y 1 and Y 5 , while the correlation between Y 2 and Y 5 was non-  Table 2).
Direct and indirect effects of Y 1 ,Y 5 on the seed yield are presented in Table 3. In the individual years from 2003 to 2006 all five seed yield components had a significantly correlated relationship with Z in at least one year (Table 2), however, path analysis showed that only Y 1 had strong direct effect (highlighted in bold in Table 3 Table 4. As for the contributions of Y 1 to Y 5 to Z, viewing totally the result of each 4 year as a group, the strongest indirect effect toward Z is Y 2 via Y 1 (the coefficients are 0.2317, 0.4805, 0.2117 and 0.4015), then orderly come Y 1 via Y 2 (0.0604, 0.2260, 0.1681 and 0.2595) and Y 3 via Y 4 (0.1025, 0.2212, 0.0187 and 0.1202). Y 5 via Y 2 had lightly a negative indirect effect to Z (-0.0042, -0.0739, -0.0502 and -0.0289). Combining the direct effects (highlighted in bold) of Y 2 to Z had negative effects in 3 years (2003, 2004 and 2006) and positive effect in 1 year (2005), obviously, Y 2 had least contribution to Z. Y 3 had positive effects to Z in four years, whereas Y 4 and Y 5 had a negative effect in one year respectively. In addition, Y 5 had more contribution to Z than Y 4 by comparing the coefficients between them from Table 3.
So, The contributions of the five seed yield components to the seed yield are orderly Y 1 The order is the same as total direct effects (2.9994, -0.2089, 0.8717, -0.0279 and 0.5881 listed in Table 3) with Y 4 and Y 2 having negative effects, but the total effects order is Y 1 .Y 3 .Y 4 .Y 5 .Y 2 (3.9808, 0.2489, 1.3569, 0.6346 and 0.6266 listed in Table 3).
Duncan's Multiple Range Test for seed yield (Z) and its components (Y 1 to Y 5 ) Showed that Z was significantly highest in 2004 followed by 2003 which was significant higher than 2005 and 2006 (Table 4). Y 1 was the highest in 2004 and produced the highest Z. Except in 2003, Y 3 was not significantly (P,0.05) different in the rest three years.  The ridge regression and multiple-regression was applied for avoiding the highly inter-correlated and multi-collinearity between Y 1 to Y 5 and Z [35,36,37,38,39].
There are several procedures have been proposed for the selection of k in ridge regression analysis, although the optimal value of k cannot be determined with certainty [36,37,39,40], and suggested that k should be determined from the ridge trace, with k selected such that a stable set of regression coefficients was obtained [38]. In this study, Figure 1 (Table 4). Partly due to sample size, the ridge models in 2005 and 2006 was significant at Pr,0.05.
All of the Z and Y 1 to Y 5 , 315 samples from the database of the 4 years totally, were taken the natural logarithm as S and C 1 to C 5 , then S and C 1 to C 5 were taken in for ridge regression analyses, and got ridge regression model as: Formula (2) was used to estimate the seed yield of all the 315 samples and denoted as Z estimated . The actual seed yields were denoted as Z actual . Then a general linear regression model was used to assess the Z actual as compared to the Z estimated . And analysis of variance for dependent variable Z actual and the parameter estimates of Z estimated was showed in Table 5 and 6. The linear line was presented in Figure 2 with the regression model as: By variance test, the parameter estimates of intercept and Z estimated were 0.00153 and 0.99999 respectively (showed in Table 7). And the linear line, presented in Figure 3, was superposed on the 1:1 line.

Discussion
The results suggest that our first hypothesis that Y 1 to Y 5 and the Z are inter-correlated, and all the five key seed yield components are positively contributed to Z could not be validated. However, our second hypothesis that a steady algorithm model, which can estimate the seed yield via the components, was found.

Seed yield components and seed yield
Results show that total direct effects of Y 1 , Y 3 and Y 5 were positively contributed to Z but Y 4 and Y 2 were negatively; whereas the total effects (indirect + direct) of Y 1 ,Y 5 to Z are positive. The negative effects of Y 2 and Y 4 were mainly canceled out by the effects of Y 1 via Y 2 (Y 1 RY 2 ) and Y 3 via Y 4 (Y 3 RY 4 ), respectively. There was no results available on negative effects of Y 2 and Y 4 in Russian wildrye. Firstly, Y 2 is mostly genetic control [41,42] (Table 4). Y 4 has the same trend as Y 2 with aging from 2.14 in 2003 to 1.75 in 2006. The large seed number (Y 4 ) has a weak negative effect on seed yield maybe from the reason of limited soil nutrition with higher density [43]. Secondly, It maybe a true mathematical relationship  [44,45], in fescues [46,47], in zoysiagrass [48], in smooth brome [49], in perennial ryegrass [50] and in grasses [2,51] and legumes [51,52]. In addition, it was inferred that path-analysis could uncover the relationships between the components and the yield agreed with parallel results [53,54,55,56]. As a seed yield component (Y 1 to Y 5 ) can affect other components positively or negatively, it is clear that measurement of simple linear relationships between two components with correlation analysis does not predict the success of selection. But, with standardized variables, path-analysis effectively determined the relative importance of direct and indirect effects on Z.
Steady algorithm model to estimate Z via Y 1 to Y 5 An exponential model was founded for estimating the Z via Y 1 to Y 5 . Firstly, it deduced from the data of 315 samples in variously growing management in successive 4 years elaborate with more words. Secondly, it was of the same order of exponent values in the model as that of the contributions of the five components to Z; this mean that there was much correspondence between pathcoefficients analysis and the ridge regressions. Thirdly, all of the four ridge regression models of the individual years were significant (2003 and 2004 (Table 4). In addition, with multi-factor orthogonal experimental designs and big sample statistical analysis in field experiment, the significant (at P = 0.0001 and 0.01) coefficients of the correlation, path analyses and ridge regressions show that the models are reliable, and that ridge regression effectively overcome the problem of highly multicorrelated predictor variables (Y 1 to Y 5 ) [35,36]. This research method may be one of the efficient and effective method in field crop experiment [39,57,58]. Unfortunately, the coefficients of the ridge regression models in individual years were various, ranged from 0.651 to 510.83 (Table 4), maybe mainly due to aging of the plant, designed field management and various climates.

Not all the five components and Z are inter-correlated
Though the experiment was set in various conditions with big sample size, the results of correlation analyses seems that theoretically accorded with biological theory in this experiment. Except Y 1 with Y 2 and Y 1 with Z, the significant correlations were various. This was probably a consequence of the effects under climate of the individual year as the fields management are yearly repeats.
The relationships of Z and Ys are highly associated with the climate Due to designed various field experimental management (experimental factor X 1 to X 10 ), there was a very wide range of seed yield and its yield components (Table S2), for example, in  (Table S2); besides aging of the plant, this is the main effect of weather conditions of the 4 years ( Figure S1). For example, that there were higher rainfall in June, which was the seed growing period, in 2003 and 2004 than in 2005 and 2006 partly result in higher seed yields as it in favor of pollination and grain filling. The most rainfall was in March 2005 which also had lower air temperature facilitated vegetative growing and decreased Y 1 (Table 4) and consequently resulted in a lower Z. In comparison, the highest Z matched the higher temperature in March and April in 2004 than in other years. However, Y 2 and Y 3 were weakly decreased going with aging of the plant from 2003 to 2006; they might be controlled by its genotypes in some degree in this experimental site.

Conclusions
Via ridge regression analysis with big sample size in Psathyrostachys juncea Nevski, the model of seed yield with its five components was: The total direct effects of the Y 1 , Y 3 and Y 5 to the seed yield were positive but Y 4 and Y 2 weakly negative; whereas the total effects (directs plus indirects) of the components were positively contributed to the seed yield by path analyses. Except Y 3 , Y 1 , Y 2 , Y 4 and Y 5 were significantly (P,0.001) correlated with the seed yield whereas Y 2 were not significant correlated with Y 3 , Y 4 and Y 5 by Peason correlation analyses. Y 1 was the major component presenting the most important and effective effect in the 5 components in the plant seed production. Therefore, selection for high seed yield through direct selection for large Y 1 , Y 2 and Y 3 would be effective for breeding programs in grasses.
The future study maybe consider the climate, e.g. rainfall and temperature in the seed growing stage, and different site locations for determining and testing the algorithm models of seed yield with the seed yield components in grasses.

Research Location and field conditions
Field experiments were conducted at the China Agricultural University Grassland Research Station located at the Hexi Corridor, in Jiuquan, Gansu province, northwestern China   (Table S1).

Experimental design
To simulate various growing conditions, the experiment used six groups (Group A to F) of multi-factor orthogonal field experimental designed plots [57,59,60,61] (Table S1). Totally 143 experimental plots with different treatments combinations were arranged. Each one of individual plot areas 28 m 2 (i.e. 4 m 67 m), and with 1.5 m spacing between the adjacent plots. Weather for the experimental sites was provided by The Meteorological Working Station in Jiuquan, of Gansu province, P R China ( Figure S1).

Data collection
Ten samples of 1 m length row were randomly selected for measuring the five seed yield components from anthesis to seed harvest during 2003 to 2006 respectively, for avoiding marginal utility, leave out 1 m from edge in the plots, which is means that samples were taken in the middle of the plot to avoid edge effect, the data of the seed yield components and seed yields of each one plot were collected by tactics as following: the samples of 1 m length row were randomly selected for measuring fertile tillers m -2 (Y 1 ). Respectively, 30 to 36 fertile tillers and 27 to 54 spikelets were randomly selected for measuring the spikelets per fertile tillers (Y 2 ), florets per spikelet (Y 3 ) and seed numbers per spikelet (Y 4 ). When the seed heads were ripen, four samples of 1 m length row were separately threshed by hand; yield of clean seed for each sample was weighted while the seed water content is at 7 to 10% for converting into seed yield (kg hm -2 ) (Z), and randomly taken 10 lots of 100-grains for determining seed weight (mg) (Y 5 ) from the samples respectively. That total numbers of samples (n) of Y 1 to Y 5 and Z are 3150, 10080, 9135, 11970, 3150 and 1260 were determined respectively in the 4 years ( Table 8). The sample size of been determined were listed in the individual years (Table 8) (Table S3).

Statistics and Analytical Method
Analyses of variance and Pearson correlation analyses were performed using the SAS Version 8.2 program [62]. The general linear model (PROC GLM) was used to assess the ridge model. Then, a Qbasic program was written for the path coefficient analysis; furthermore, Duncan's multiple range test for Z and Y 1 to Y 5 were performed. Data were transformed when necessary using logarithmic and power transformations in order to avoid the effects of highly inter-correlated, leading to multi-collinearity among Y 1 to Y 5 with Z.
To establish a reliable model, combined data for all of the Z and Y 1 to Y 5 in Visio FoxPro, totaling 315 samples of Z (105+134+60+16 = 315) with their corresponding components (Y 1 to Y 5 ) over the four years studied, were taken as the natural logarithm because, mathematically, they did not influence the essential relations of the variables [37,39,63].
If S = In Z, C i = In Y i , (i = 1 to 5), then S and C 1 to C 5 were used for the ridge regression analyses [39], ridge regression model is: 100-seed was taken as one sample, at a seed water content of 7,10%, then 10 of the 100-seed sample in each plot were averaged to obtain one sample of seed weight (Y 5 ) of the plot; the total sample size (n) of Y 5 = 106105 = 1050 in 2003. b Total sample size (n) = Sample size of plots (N) 6Sample size of each plot (n), e.g., the number of spikelets fertile tiller -1 from 36 fertile tillers in each plot in 2003 was counted, then averaged as spikelets fertile tillers -1 (Y 2 ) of the plot, so, the total sample size (n) of Y 2 = 105636 = 3780. doi:10.1371/journal.pone.0018245.t008 Where S is an n61 vector of observations on a response variable, C is an n6p matrix of observations on p explanatory variables, Þ is the p61 vector of regression coefficients and u is an n61 vector of residuals satisfying E (ū ) = Ċ , E (uu9) = d 2 I. It is assumed that C and S have been scaled so that C9C and S9S are matrices of correlation coefficients [39]. Here n = 315, p = 5. Thus, The above logarithmic model (7) was transformed to an exponential function as: Where a, b are constants. Formula (8) was used to estimate the Z of all 315 samples, and it was denoted as Z estimated ; the actual seed yields were denoted as Z actual .
A general linear regression model was used to assess the Z actual , as compared to Z estimated , and an analysis of variance was used to assess the dependent variable Z actual and the parameter estimates of Z estimated . The linear regression model is: So, via formula (9), the model was adjusted to Z~bzk : e a : P The separate analyses for the four years provided useful information. Simple statistics (PROC MEAN) was made on the results and ridge plots were did.