Predicting yield of individual field-grown rapeseed plants from rosette-stage leaf gene expression

doi:10.1371/journal.pcbi.1011161

Fig 1.

Field trial layout and PCA plots for gene expression and phenotypes.

A. Plants were sown on a 10x10 equispaced grid with 0.5 m between rows and columns. Plant identifiers combine a number indicating the row (01–08) and a letter indicating the column (A-H) in which the plant was sown. Only plants with leaf 8 gene expression and phenotype profiles are labeled, border plants and grid positions at which no plants emerged are indicated by grey squares. B. Plot of the first two principal components of the leaf 8 gene expression dataset, after normalization and RNA-seq batch correction (see Methods). C. Plot of the first two principal components in the phenotype dataset. Individual plants in B and C are colored according to the color gradient in A, with similar coloring of plants indicating spatial proximity in the field.

More »

Expand

Table 1.

Numbers of significant gene expression-phenotype associations.

More »

Expand

Table 2.

Best-performing multi-gene and single-gene models for each phenotype.

More »

Expand

Fig 2.

Predictions versus observations for the best-scoring leaf and yield phenotypes.

A. Predicted versus measured values for leaf 8 width (76 DAS), using the all-genes model with the best median test R² score (enet + median feature selection, Table 2). B. Predicted versus measured values for seed weight stem 1, using the all-genes model with the best median test R² score (enet + Spearman feature selection, Table 2). Vertical grey lines range from the minimum to the maximum predicted value for a given plant across all model repeats, and colored dots represent predictions for the repeat with the median pooled R² score (i.e. the R² score of the pooled test set predictions in the repeat concerned). Different marker colors indicate the 10 different test sets in this repeat. Perfect predictions are located on the dashed diagonal line in each panel. Similar plots for other phenotypes are presented in S6 Fig.

More »

Expand

Fig 3.

Top predictor genes in RF models of leaf phenotypes.

A clustered heatmap of the z-scored gene expression profiles of the top genes for predicting leaf phenotypes is shown centrally (blue-red color scale, Ward.D2 hierarchical clustering). The leaf phenotypes concerned and their z-scored profiles across plants are shown at the bottom (dark blue-yellow heatmap with plant identifiers at the bottom). For each of these phenotypes, the top-10 most important genes (highest median gini importance across all 90 cross-validation splits) of the RF model with the highest median test R² score are included on the figure (gene identifiers are shown at right). The mostly dark blue score panel to the left of the expression heatmap shows the median gini importance scores of the selected genes in each of the selected phenotype models, normalized to the maximum importance score per model to make the color scales of the different models (columns) comparable. The mostly yellow frequency panel to the left of the score panel shows the frequencies at which genes were selected as features across all 90 cross-validation splits of a given model. Grey squares in the score and frequency panels indicate that a given gene was not selected as a feature in a given model. The phenotypes in the score and frequency panels are identified by numbers (1–8) on top of the panels, corresponding to the numbers associated with the phenotypes in the bottom phenotype panel. On top of the score panel, the feature selection techniques used in the best-scoring RF models for each phenotype are shown (median = selection of features with median rlog gene expression > 0, spearman = Spearman correlation, hsic-5000 = HSIC lasso, see Methods), as well as the corresponding test and pooled R² scores rounded to the nearest 0.1 and then multiplied by ten (e.g. a test R² score of 0.38 would be denoted as 4). Genes that are also found in the top-10 enet predictor lists for leaf phenotypes (S9 Fig) are highlighted in red, while genes that are also found in the top-10 enet or RF predictor lists for yield phenotypes (Figs 4 and S10) are highlighted in blue. Genes found in both the top-10 enet predictor lists for leaf phenotypes and the top-10 enet or RF predictor lists for yield phenotypes are highlighted in magenta.

More »

Expand

Fig 4.

Top predictor genes in enet models of yield phenotypes.

A clustered heatmap of the z-scored gene expression profiles of the top genes for predicting yield phenotypes is shown centrally (blue-red color scale, Ward.D2 hierarchical clustering). The yield phenotypes concerned and their z-scored profiles across plants are shown at the bottom (dark blue-yellow heatmap with plant identifiers at the bottom). For each of these phenotypes, the top-10 most important genes (highest median elastic net coefficients across all 90 cross-validation splits) of the enet model with the highest median test R² score are included on the figure (gene identifiers are shown at right). The mostly green-blue score panel to the left of the expression heatmap shows the median elastic net coefficients of the selected genes in each of the selected phenotype models, normalized to the maximum coefficient per model to make the color scales of the different models (columns) comparable. The mostly yellow frequency panel to the left of the score panel shows the frequencies at which genes were selected as features across all 90 cross-validation splits of a given model. Grey squares in the score and frequency panels indicate that a given gene was not selected as a feature in a given model. The phenotypes in the score and frequency panels are identified by numbers (1–8) on top of the panels, corresponding to the numbers associated with the phenotypes in the bottom phenotype panel. On top of the score panel, the feature selection techniques used in the best-scoring enet models for each phenotype are shown (median = selection of features with median rlog gene expression > 0, spearman = Spearman correlation, hsic-5000 = HSIC lasso, see Methods), as well as the corresponding test and pooled R² scores rounded to the nearest 0.1 and then multiplied by ten (e.g. a test R² score of 0.38 would be denoted as 4). Genes that are also found in the top-10 RF predictor lists for yield phenotypes (S10 Fig) are highlighted in red, while genes that are also found in the top-10 enet or RF predictor lists for leaf phenotypes (Figs 3 and S9) are highlighted in blue. Genes found in both the top-10 RF predictor lists for yield phenotypes and the top-10 enet or RF predictor lists for leaf phenotypes are highlighted in magenta.

More »

Expand

Table 3.

Best-performing multi-phenotype and single-phenotype models for mature plant phenotypes.

More »

Expand

Fig 5.

Predictive power of early rosette areas for yield phenotypes.

In each subplot, median test R² values are plotted for lme models predicting the given phenotype from early rosette areas v2 (14–42 DAS, x-axis). Only mature phenotypes that can be predicted from rosette area (42 DAS) with a median test R² > 0.1 are shown. Blue lines are ordinary least-squares linear regressions, with shaded areas indicating 95% confidence intervals on the trendline. Most phenotypes exhibit a rather dichotomous median test R² profile with rosette areas v2 from 14 to 28 DAS yielding substantially lower median test R² values than rosette areas v2 from 32 to 42 DAS. Accordingly, linear model fits at 28 and 32 DAS are often poor.

More »

Expand