It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.
Citation: Tian T, McLachlan GJ, Dieters MJ, Basford KE (2015) Application of Multiple Imputation for Missing Values in Three-Way Three-Mode Multi-Environment Trial Data. PLoS ONE 10(12): e0144370. https://doi.org/10.1371/journal.pone.0144370
Editor: Alan Hubbard, University of California, Berkeley, UNITED STATES
Received: March 3, 2015; Accepted: November 17, 2015; Published: December 21, 2015
Copyright: © 2015 Tian et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: The principal supervisor (KB) of TT's research higher degree (PhD) study has a research collaboration with Monsanto Company for unrelated activities. TT receives a living allowance from these funds. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The funding for the PhD scholarship for Ms Ting Tian came from Monsanto Company, but there is no employment, patents, products in development, marketed products or any restrictions whatsoever on the research work reported in this manuscript. Unrestricted funding for a scholarship was the requirement for independent consultancy work by Professor Kaye Basford for Monsanto Company. She has no patents, products in development, or marketed products associated with that company. There are no competing interests in relation to the research being reported in this manuscript. This competing interests statement does not alter or impact on the authors' adherence to PLOS ONE policies on sharing data and materials.
Multi-way data analysis has become common in many areas of research involving multivariate data. Three-way three-mode pattern analysis refers to the combined use of such clustering and ordination procedures. Its application to multivariate multi-environment trial (MET) data has provided a comprehensive summary of the patterns of variation and the interactions among the three modes, genotypes, environments and attributes, for plant breeders and other scientists interested in plant improvement [1, 2]. However, many multivariate MET datasets are incomplete and the presence of missing values cause complications because most analytical methods developed for multivariate data assume complete data arrays [3, 4]. This is the case for (iterative) clustering and ordination procedures where the inability to routinely apply them to incomplete datasets has been an obstacle to their wider usage (as a full data array is needed to provide starting values for any necessary iteration). Thus, it is important to obtain the best possible estimates of missing values to form a complete multi-way MET data array which can then be subjected to multi-way pattern analysis.
There are some statistical methods and mathematical algorithms specifically designed to handle incomplete two-way two-mode data matrices. In one of them, multiple imputation (MI) [5, 6] is used to generate different imputed values for each missing value to form different complete datasets. Then the different complete two-way datasets were analysed in order to obtain estimates of the parameters of the corresponding models because these parameters were the main interest for some authors . These different complete datasets were defined as the “estimated data arrays” as they were the complete data arrays containing the estimated missing values using MI approaches.
While we wanted to use multiple imputation to generate different imputed values for each missing cell (and eventually obtain one estimated data array for each incomplete multivariate MET dataset), the estimation of the (different) parameters in the various models used in the imputation process were not of concern to us. Thus, we focused on using different MI approaches to obtain “good” estimates of the missing values to form a complete “estimated data array” which could then be analysed by three-way three-mode pattern analysis, rather than for parameter estimation. The MI methods mentioned above (for two-way two-mode data matrices) were modified to take into account the three-way structure of multivariate MET data. We also introduced one novel MI approach which does not have an underlying model that can be written in a similar format to the others.
To demonstrate the use of MI for estimating missing values in multivariate MET data, two real complete MET datasets and four simulated complete MET datasets were considered. Missing values were generated by randomly deleting values in the full datasets. The methods were assessed by comparing the original complete data arrays with the “estimated data arrays”, i.e., the complete data arrays containing estimated missing values. This enabled us to compare all of our methods for imputing missing values. Again, we stress that this was more important to us than the relative efficiency of the various estimators for the parameters in the models used in some of the imputation methods.
Some brief notation about the three-way three-mode data structure is described in the Materials and Methods. The basic algorithms for various MI approaches and corresponding modification in terms of multivariate MET datasets are also described. We then present the six multivariate MET datasets and the random generation of missing values, followed by the results of comparing the original complete data arrays with the complete data arrays containing estimated missing values. We end by discussing the implications of our findings.
Materials and Methods
Three-way three-mode MET data
A MET data array generally consists of I genotypes, J environments and K attributes. It can be written as a collection of frontal slices Xk (I×J matrix, k = 1,…,K), where rows are genotypes and columns are environments, and each cell xijk is the value measured on the ith genotype in the jth environment for the kth attribute [8, 9] (Fig 1).
Each frontal slice Xk corresponds to I genotype responses across J environments for a particular attribute, such as yield, moisture or test weight. Each attribute has its own measurement units. There are two types of vectors for each frontal slice, a row vector (i.e. measurements on the ith genotype for the kth attribute for all J environments) and a column vector xjk (i.e. all I genotype responses measured in the jth environment for the kth attribute). We conducted column standardization in order to remove the environmental main effects, but retain the correlation among attributes (over genotypes) for each environment [1, 10, 11], i.e. we standardized each column vector xjk prior to the analysis. It can be defined as: where the environment main effect is removed as and the correlation among attributes for each environment is retained as:
Kroonenberg  discussed the various types of missing values in three-way three-mode data, and they are described here in terms of multivariate MET data:
- Single observations missing, e.g., individual genotypes in particular environments for a specific attribute are missing. These are missing cells in the three-way three-mode array (Fig 2A).
- Column missing, e.g., a particular attribute is not measured on any genotype in a particular environment. This would correspond to a missing column (xjk) in our three-way array (Fig 2B) and is quite common. A missing row () where a particular attribute is not measured in any environment for a particular genotype is extremely rare in practice and will not be considered here.
(a) missing cells, (b) missing columns.
Initially, we rearranged the standardized multivariate MET data array by writing it as a two-way wide matrix , where the I rows are the I genotypes and JK columns are the J environments nested within each of the K attributes. It can be viewed as the following matrix including missing values: where NA indicates that the observation is not available (missing). Note that there are only missing cells and a missing column (corresponding to a particular attribute not being measured on any genotype in that particular environment), where we only consider one missing column as an example of missing columns.
Multiple imputation approaches
We are interested in two patterns of missing values in the two-way wide matrix of the multivariate MET data, i.e. missing cells (e.g. a data value missing for a particular attribute for a particular genotype in a particular environment) and missing columns (e.g. data values missing for a particular attribute for all genotypes in a particular environment). We consider the different MI approaches in terms of how they take into account these missing patterns when: (1) the estimation task is the same for missing cells and columns; (2) the estimation task is different for missing cells and columns. Under (1), we study imputation approaches based on multiple agglomerative hierarchical clustering (MAHC) and the normal distribution model (NORM) . Under (2), we study imputation approaches based on the normal regression model (NRM)  and predictive mean matching (PMM) [12–14].
The latter three common MI approaches (NORM, NRM and PMM) were modified in terms of the multivariate MET data structure. In each of these, Bayesian analysis under Gibbs sampling  and non-Bayesian analysis were used in implementing the estimation task.
Multiple Agglomerative Hierarchical Clustering (MAHC)
When investigating the genotype response pattern in multivariate MET data, it is most common to cluster the genotypes in terms of the measurements on the attributes in each of the environments. If there were only missing cells, then we could use an agglomerative hierarchical clustering procedure on the two-way wide standardized matrix to provide an estimate of a missing value for a particular genotype (by using the attribute value within the same environment of the genotype (or the genotype group) with which the genotype with the missing value first merged in the agglomerative process). However, if an attribute is not measured on any genotype in a particular environment, we cannot do this. We were not able to devise any procedure for estimating missing values by clustering genotypes when a whole column of the matrix was missing.
It was decided to rearrange the matrix as an matrix and cluster the environments. Then even when an attribute was not measured on any genotype within a particular environment, there would be no missing columns in the matrix. Hence we could use agglomerative hierarchical clustering of the environments to replace missing values for a particular attribute for all genotypes in a particular environment by the values of the same attribute for each genotype within the environment (or environment group) with which the environment with missing values first merged. This process of clustering environments to estimate missing values can also be used to estimate missing cells, i.e., when there was a missing value for a particular attribute for a particular genotype for a particular environment.
The process of estimating missing values using agglomerative hierarchical clustering of the environments, as described in the previous paragraph, only provided a single estimate for each missing value. We wanted a multiple imputation approach. In achieve this, we chose different subsets (or combinations) of attributes for each imputation of hierarchical agglomerative clustering of the environments, as that would result in (potentially) different hierarchies of the environments and subsequent estimates of missing values.
We initially tried to implement this MI procedure by choosing Kh attributes (by a uniform sample from the K measured attributes) for imputation h, h = 1,…,H, (H being the total number of imputations) with the value of Kh being any value from 1 to K. We conducted an agglomerative hierarchical clustering of the environments using the measurements of these Kh attributes on the I genotypes in the J environments expressed in the form of a J×IKh two-way wide array. The environment response for a missing genotype-attribute value was estimated by the non-missing genotype-attribute value in the environment (or environment group) to which the environment with the missing value first joined. However, there was the possibility of not being able to estimate particular missing values because there were missing values on an attribute which was not chosen in that specific Kh attributes.
Hence we modified our MI procedure to ensure that all attributes were chosen in each imputation h. Thus for each h, we decided to choose all attributes (as in agglomerative hierarchical clustering) as well as Kh attributes with the value of Kh being any of the values from 1 to K-1. We conducted one agglomerative hierarchical clustering of the environments using all of the K attributes (with data in the form of a J×IK two-way wide matrix) and another agglomerative hierarchical clustering of the environments using the measurements of these Kh attributes (with data in the form of a J×IKh two-way wide array). In each case, the environment response for a missing genotype-attribute value was estimated by the non-missing genotype-attribute value in the environment (or environment group) to which the environment with the missing value first joined. If two estimates were obtained for a particular missing value, they were averaged. This process was repeated H times, and the final estimates of missing values were the averages of the corresponding estimated values from each of the H imputations.
This multiple agglomerative hierarchical clustering used a dissimilarity measure between environments (and environment groups) based on squared Euclidean distance and a grouping strategy which minimized incremental sum of squares . It takes into account different combinations of attributes at each imputation, leading to average values (over the imputations) of the estimated missing values for each cell.
Normal Distribution Model (NORM)
Let be the J×1 vector corresponding to the (standardized) responses for the ith genotype grown in all J environments for attribute k. The values are taken from the two-way wide standardized matrix and would have been part of the ith row in that matrix. Then it is assumed that this vector, is multivariate normally distributed with mean vector μik (a J×1 vector) and covariance matrix (a J×J matrix), with I denoting an Identity matrix of size J×J. The corresponding mean vector μik and covariance matrix Σik are unknown. This could be written as:
We obtain the estimated values of unknown parameters μik and Σik using Bayesian analysis and non-Bayesian analysis as follows.
To conduct the Bayesian analysis, we needed to construct the prior distributions of parameters μik and Σik. Here, we assumed that there was no strong prior information, so the Jeffrey’s prior distribution  would be as follows:
As a constant value (here ) is not a very realistic probability density function, it is an improper prior distribution for . However, when applying Bayes’ rule, this constant value prior distribution leads to a proper posterior probability density function, introducing some information about μik and . Then the posterior distribution was derived by
Then the full conditionals of and were: where is the sample covariance matrix for known μik, and where is the sample mean vector. Therefore, and . The posterior distributions of parameters μik and were the normal distribution and inverse-gamma distribution, respectively.
Using Gibbs sampling , the estimates of the parameters μik and were obtained. Gibbs sampling is Markov Chain Monte Carlo (MCMC) methodology. It was used to generate Z = (Z1,Z2,…,Zn) from a target probability density function (pdf) f(z), given the conditional pdf f(zi | z1,z2,…zi−1,zi+1,…,zn). During the process of Gibbs sampling, the Markov Chain was generated from a sequence of conditional distributions.
From the above derivation, the conditional posterior distributions of parameters μik and were the normal and inverse-gamma distribution, respectively. After the full distributions of parameters μik and were obtained, the estimates of μik and were obtained by generating a large number of extracted samples (here 5500) from those distributions and setting the estimate equal to their expectation values, i.e.
Therefore, the missing values for the ith genotype measured on J environments for each attribute k could be drawn from the following equation: (1) where z is random standard Normal vector. As a result, H imputed values for each missing values could be obtained by implementing the above process H times.
Normal Regression Model (NRM)
Let be the number of missing genotypes measurements in the jth environment for the kth attribute. Under the NRM, the particular standardised column vector (an I×1 vector) containing missing values can be expressed as a partition into two vectors, one containing non-missing values (an vector), and the other containing missing values (an vector). The response (or measurement) for one attribute k measured on a particular genotype i in a particular environment j is not independent of the responses for the other attributes measured on that genotype in that environment. Hence, the column vectors (genotype responses) for attribute k in environment j are not independent of the column vectors (genotype responses) for the other attributes k′ (k′ = 1,…,K, k′≠k) in that environment.
We assumed a standardized column vector satisfied the regression model , where βjk (a N×1vector) is the usual regression coefficient, I is an I×I identity matrix, and X* is the I×N design matrix containing elements from the I genotype responses for the N columns in the two-way wide matrix.
In order to modify the models used with two-way data arrays to take account of the multivariate nature of the measurements on each genotype in each environment, we employed the design matrix X* which could be expressed in four ways (i.e. there were four options for the N columns). In these options, the “zero” elements in matrix X* (or defined shortly) substituted for missing values. Thus, the missing values do not contribute to the normal regression model. However, there were a few “true” zero values in the data array because of the column standardization conducted prior to the analysis. These zero values do not contribute to the model either, as their original values corresponded to the average environment effect.
The four ways to express the design matrix X* were as follows:
- The first option contained elements from the I genotype responses over the (JK-1) columns, i.e. I×(JK-1) matrix. where the column corresponding to the particular environment j for particular attribute k was discarded, “*” denotes the discarded column
- The second option contained elements from the I genotype responses over the (J-1) environments for each attribute, i.e. an I×K(J-1) matrix. where the columns corresponding to the particular environment j for different attributes k (k = 1,…,K) was discarded.
- The third option was a combination of independent elements in I×K(J-1) columns and adjusted dependent elements in (K−1) × rkk′ columns. where the column corresponding to the particular environment j for particular attribute k was discarded, but the other columns which were discarded in the second option were replaced by those columns multiplied by their corresponding correlation coefficient between attribute k and k′(k′ = 1,…,K, k′≠k) for the same environment j, i.e. there were (KJ-1) columns in the design matrix. The correlation coefficients rkk′(j) are described in the correlation matrix R (shown later).
- The fourth option contained elements from the I genotype responses for the same environment over the (K-1) attributes, i.e. an I×(K-1) matrix. where the columns corresponding to the particular environment j with different attributes k′ (k′ = 1,…,K, k′≠k) were retained and all other columns were discarded.
The above design matrices could also be expressed as a partition into two matrices, one containing () rows in , and the other containing rows in .
Estimation of missing cells.
When the parameters βjk and were determined (explained below), the missing cells in one environment j for one attribute k could be drawn from Eq (2) : (2) where z is random standard Normal vector, and is the corresponding design matrix described above. For the third way of determining the design matrix X*, the correlation matrix Corj among attributes for any environment j, j = 1,…,J is:
We took the distinct upper off-diagonal elements and wrote them as a K(K-1)/2 row vector , where k′ = 1,…,K,. Then we combined all such vectors for each of the J environments to obtain a J×K(K-1)/2 full correlation matrix R, shown as follows: where NA indicates that the correlation coefficients between attribute k' and the other attributes for environment j′ are not available (i.e. the genotype responses for environment j′ for attribute k' are missing).
Estimation of missing columns.
When there is a missing column (, an I×1 vector), there are no observations for that particular attribute in that particular environment. It is impossible to estimate parameters βj′k′ and from the non-missing values in this column (as there are no non-missing values). Therefore, we propose using the correlation coefficients to estimate the missing column.
However, as shown in the correlation matrix R, the correlation coefficients between attribute k′ and the other attributes for environment j′ were not available. Thus, the replacement of these correlation coefficients was computed by the average correlation (, j≠j′) between attribute k′ and the other attributes over the (J-1) environments. Then, the average correlation between attribute k′ and the other attributes acted as the linear regression coefficient for this particular environment j′. Therefore, the particular missing column was estimated as follows: (3) where is the average value of the σjk from the other non-missing environments for each attribute (i.e. K(J-1)).
Alternatively, missing correlation coefficients in the matrix R could be considered as some of the elements in the response vector (j′ = 1,…,J, j′≠j) which is assumed to be linearly related to the respective correlation coefficients for the different environments j (j = 1,…,J, j≠j′) in R. That is equivalent to where βjj′ could be obtained by the MLE, and then the (K-1) missing correlation coefficients (rk′1(j′),…,rk′K(j′), k′ = 1,…,K, k′≠k) were estimated by linear regression. Therefore, the particular missing column was estimated by following equation: (4)
Using the above process, the estimation of values in the missing columns (corresponding to a particular environment-attribute combination) is based on measurements of the observed attributes in the same environment and they are not independent of those other attributes measured on the genotypes in that environment. On the other hand, the estimation of values for single missing cells in particular columns is based on one of four ways of combining other observations, and the optimum combination is determined by the accuracy of estimation performance.
To obtain estimates of parameters βjk and in the above, both Bayesian analysis and non-Bayesian analysis were used.
To conduct Bayesian analysis, we assumed that the prior distribution of βjk was normally distributed with βjk ∼ N(β0, Vβ) and was inverse-gamma distributed with . We followed Gelman  in assuming that Vβ = τ2I, and the hyper-parameters β0, ν0, S0 and τ2 were fixed and known. As the sample variance of each observed column vector is 1, the shape parameter ν0/2 and scale parameter S0/2 have ν0 and S0 set to 4 and 2, respectively, Thus the mean of with inverse-gamma prior distribution is 1 (). In addition, the sample mean of each observed column vector is zero, hence, the mean of βjk with normal prior distribution is 0 (= β0).
Therefore, , and .
Thus, the full distribution can be obtained using Gibbs sampling.
From the above equations, when the prior distributions of parameters βjk and were set up as the normal and inverse-gamma distribution, respectively, their corresponding conditional posterior distributions were also normal and inverse-gamma distribution, respectively. After the full distributions of parameters βjk and were obtained, the estimates of βjk and were obtained by generating a large number of extracted samples (here 5500) from those distributions and setting the estimate equal to their mean values, i.e.:
Predictive Mean Matching (PMM)
Rubin  proposed a statistical matching method for univariate nonresponse data while Little  developed and modified the method for multivariate nonresponse data, calling it predictive mean matching, where the respondent genotype vector (an I0×1 vector, I0<I) satisfied a regression model. The parameters in the model (i.e. the regression coefficients β and residual variance σ2) were determined by non-missing values. The predicted values of respondent genotype vectors, including non-missing and missing values, were obtained from the regression model. All predicted values for non-missing values were compared with the predicted values for the missing values by a distance function . Then the C (C<<I0, e.g. 1, 2, and 3…, say 3) closest predicted values to a predicted missing observation implied that these particular C actual non-missing values could be used to estimate the missing value. One of these C values was chosen at random for each such missing value.
Estimation of missing cells.
The regression model we used was again the normal regression model (NRM) as . Here, the design matrix employed one of the four options described above, i.e. the one determined to have the most accurate estimation performance. We drew a bootstrap sample of I0 observations (I0≤(I-Im)) from the non-missing (I-Im) values in the vector (i.e. an (I-Im)×1 vector) for each imputation h, and put these values into a new vector , which also contained the missing values, so it was of size (I0+Im). It contained two components, of size I0×1, and of size Im×1. The estimators of βjk and were obtained using the same procedure as we described in the NRM imputation above but with a design matrix which has (I0+Im) rows.
Each of the predicted values within the vector were compared with each of the predicted values within the vector using the following function : where are the elements within the vector, are the elements within the vector. For each , (∀i′, j, k, i = 1,…,I0) has different I0 values, and we wanted to obtain the C smallest values from them. Then the estimate of each missing cell was randomly selected as one of these C corresponding actual values within the vector for each imputation.
Estimation of missing columns.
For a missing column (environment j′ in which attribute k′ was not measured), the predict function for this particular column is or , hence the predicted values of the observations in this missing column are related to the observations within the same environment j′ measured for the other attributes k (k = 1,…,K, k≠k′). Then the predict function for these observations () is , where Nk is a randomly selected uniform sample from the K attributes, and is a Nk(J-1)×1 vector. The computation of estimators of parameters and were the same as for the NRM imputation above using Gibbs sampling and MLEs.
As predicted values of observations within such a whole missing column were , the distance was calculated between the elements of and . Thus the estimate of each of the missing values in the missing column was randomly drawn from one of the C closest corresponding actual values in (k⊰Nk, k≠k′), where these C estimates were determined from the C smallest values of .
Comparison of methods
For each MI method discussed above, there were H imputed complete datasets, called “estimated data arrays”, containing the observed values in the incomplete data array and the imputed estimates of the missing values in that array. The overall or final “estimated data array” was obtained by averaging the cells in the H imputed data arrays. It could also be determined by averaging the H estimates of the individual missing values and putting them into the incomplete data array to form a complete array. Comparisons between the original data arrays and the “estimated data arrays” were conducted using the normal root mean square error (NRMSE)  where is a standardized element in the original MET data array, and is a standardized element in the “estimated data array”, as the standardized non-missing elements in the “estimated data array” are different from the standardized corresponding non-missing elements in the original MET data array. By considering each element of the data array in the NRMSE computation, we investigated both the influence of column standardisation prior to the analysis and the estimation performance. Also, the missing values could be estimated using the EM algorithm [9, 20, 21] in the three-mode ordination, referring to as the Tucker3 model [9, 22]. This method of estimating the missing values is defined as single imputation . We therefore needed to compare the techniques we are proposing for estimating missing values (estimates for missing cells and estimates for a missing column) with those generated by the Tucker3 model. The NRMSE values for each MI method and the EM algorithm were compared to determine which method was more accurate for estimating missing values.
In the above, we considered both missing cells and missing columns simultaneously. However, we could consider the two patterns of missing values separately, i.e. missing cells alone and a missing column alone. We did that by comparing the estimation performance in each case using the following criterion to test the efficiency of multiple imputation.
There were H different imputed datasets for missing values in the full “estimated data arrays”. The corresponding imputed data values for the missing values, , and variance of all missing values for each imputation h, Uh, were obtained. For MI analysis, there are two types of variance [5, 6, 9, 23]. One is called the within-imputation variance and defined by . The other is referred to as the between-imputation variance and defined by , where . Then the total variance associated with the overall estimate is .
Because we assessed these MI methods using complete data arrays from which we discarded values (those designated as “missing”), we know the original “true” or “actual” values of the missing values. Each element of the difference between the “true” values of missing values Qh and overall estimate divided by its overall standard deviation () has an approximate t distribution with degrees of freedom [5, 24]. For such a t distribution, we could calculate the 95% confidence interval (CI) for the “actual” values, and then determine the percentage of these CIs which contained their corresponding “actual” values (of the total number of estimated values). This is referred to as Coverage CI . The higher the value of Coverage CI, the more efficient the MI method is judged to be.
It was useful to repeat the MI process, as a different selection of cells would be designated as missing each time. We arbitrarily chose 10 repetitions and subsequently calculated a mean Coverage CI and standard error of that mean for each MI method.
MI approaches were considered in two ways, i.e. the same estimation procedure for missing cells and a missing column, and different estimation procedures for missing cells and a missing column. Thus, we divided our investigation of missing values for each repetition into missing cells alone, missing column alone, and combined (equivalent to all missing values). For missing cells alone and all missing values, we considered 5%, 10%, 15%, 20% and 25% missing values. This enabled us to evaluate the estimation performance of the MI approaches for each of these situations. For a missing column alone, we either deleted each environment for each attribute or deleted 10 distinct random environments for each attribute.
Firstly, the estimation procedure for missing cells alone was conducted for each MI approach, to give 100 imputed values for each missing cell. These 100 values were randomly allocated into 5 sets of 20 values, and the average over each of the 20 imputed values for each percentage of missing cells for each repetition gave the final 5 (H) imputed values for each missing cell. Based on those 5 imputed values for each missing cell, we calculated the 95% confidence interval for the “true” value. Then we obtained the percentage of all estimated missing cells where the 95% CI contained the “actual” value of the missing cell (Coverage CI). As the MI process was repeated 10 times, we calculated the mean Coverage CI and its standard error. Note that the percentage of missing cells for each repetition was slightly less than the percentage of missing values being quoted as there was a designated missing column (in the incomplete two-way wide array) whose cells were not included here.
Secondly, a missing column (corresponding to an attribute not measured on any genotype in a particular environment) was included in each percentage of missing values for each repetition. As it was randomly specified, it could be different across repetitions. We decided to investigate the variability of missing columns in the estimation procedure by carefully considering the choice of columns across replicates. For the small datasets (i.e. when the number of environments was 10 or less) each column was sequentially selected as the missing column (environment) for each attribute for each imputation in turn. The 100 imputed values of each missing data value in the missing column were obtained for each replicate. Again, 5 average imputed values (obtained by dividing the 100 values randomly into 5 sets of 20) of each missing value in the missing column were used to calculate the 95% Coverage CI. The repetitions gave J Coverage CIs (for each attribute) from which a mean and standard error were obtained. For the larger datasets (i.e. when the number of environments was greater than 10), we did not sequentially select each column (environment) to be missing in the replicates. Instead, we randomly selected 10 environments without replacement to be the missing column (environment) for each attribute for each imputation in turn. Then the same calculation was applied to obtain the 95% Coverage CI for each replicate. This gave 10 Coverage CIs (for each attribute) from which a mean and standard error were obtained.
Finally, we considered all missing values (missing cells and a missing column simultaneously). We obtained 100 imputed values of each missing value (whether it corresponded to a missing cell or to a cell in a missing column) for each percentage of missing values for each repetition of the MI process. To compare with the EM algorithm for estimating missing values, we took the average over the 100 imputed values of each missing value to form the final “estimated data array” for each percentage of missing values for each of 10 repetitions. We already had these final “estimated data arrays” for each of 10 replicates for the EM algorithm. Then the estimation performances of the different MI procedures and the EM algorithm were compared using the normal root mean squared error (NRMSE) criterion for each of the 10 replicates. These were presented as box plots for each method for each percentage of missing values.
MET data arrays
We employed two real complete multivariate MET datasets and four simulated multivariate MET datasets. The first real multivariate MET dataset, described by Basford and Tukey , consisted of 58 soybean lines evaluated in 4 sites in Queensland Australia in each of 2 years (denoted as 8 environments) for 6 attributes, i.e. a 58×8×6 data array. The second was from the 2014 CIMMYT wheat breeding program. It contained 50 wheat lines grown in 31 environments with measurements on 4 attributes (i.e. a 50×31×4 data array). These were denoted as Datasets 1 and 2, respectively. The other four datasets were simulated to be of various sizes. They were based on other trials in the CIMMYT wheat breeding program, the first of size (60×10×6) and the second of size (80×15×6), denoted as Datasets 3 and 4, respectively, and maize trials from a commercial company, the third of size (100×20×5) and the fourth of size (120×60×4), denoted as Datasets 5 and 6, respectively.
For simulated MET datasets, we computed the variance components for each random effect () within each attribute from an analysis using the mixed linear model for three-way three-mode real MET datasets. The values of the variance components were estimated using restricted maximum likelihood [27, 28] as implemented in the ASREML software . As the estimated variance components for genotype, environment, genotype×environment, and residuals were obtained, the data values were randomly multivariate normally distributed with mean zero and corresponding estimated variance-covariance matrix from the real multivariate MET data (with only genotypes having non-zero off-diagonal terms). Then these values were used to form the full simulated datasets.
The results for missing cells, a missing column, and overall missing values will be discussed in turn.
Firstly, Table 1 contains the mean and standard of Coverage CI (calculated over the 10 repetitions) for each MI estimation method for each percentage of randomly generated missing cells. Note that the percentage of missing cells is less than the percentage of missing values being quoted (because only missing cells were considered, not the missing column, i.e. the percentages in the table indicated the number of missing cells including those in the missing column).
As expected, the MAHC imputation was efficient, as the average coverage rates and their standard errors were good (means 74% or higher, standard errors less than 3.7%) for the real datasets (Datasets 1 and 2). For the simulated datasets (Datasets 3 to 6), the average coverage rates of MAHC imputation were similar to or slightly smaller than those for Datasets 1 to 2 (means over 71%), but the corresponding standard errors were somewhat higher (standard errors less than 4%) than those for the real datasets.
For the other imputation methods (NORM, NRM, and PMM) based on both analysis (Bayesian and non-Bayesian), the average coverage rates were lower and their standard errors were higher than those for MAHC at every percentage of missing cells for each dataset. Overall, both forms of NORM imputation had relatively higher coverage rates than those for the other two imputations, especially for larger datasets. The next best was PMM imputation.
On average, the results of Bayesian analysis of NORM imputations had slightly larger coverage rates than those for non-Bayesian analysis, but there was not much difference in coverage rates between Bayesian and non-Bayesian analysis for NRM and PMM imputations. The standard errors of the means for Bayesian and non-Bayesian analysis for the three imputation methods were quite similar at every percentage of missing cells. Also, the values of the standard errors of the coverage rates did not differ across the percentages of missing cells for most datasets.
Secondly, each environment was sequentially designated as the missing column for each attribute for the smaller datasets, while each of 10 random environments (chosen without replacement) was designated as the missing column for each attribute for the larger datasets. The mean and standard error of Coverage CI calculated over each environment missing in turn for each attribute for each estimation method for the smaller datasets and calculated over 10 environments (chosen at random without replacement) for each attribute for each estimation method for the larger datasets were presented in Table 2. The number of repetitions used to calculate average CI coverage for each dataset was 8 (for Dataset 1) and 10 (for Dataset 3), while the number of repetitions for Datasets 2, 4, 5 and 6 was 10. Therefore, Table 2 showed the average estimation performance for a missing column for each attribute for each dataset. The estimation procedures to estimate a missing column (environment) were the same as those to estimate missing cells for MAHC and both analysis of NORM imputation. For NRM and PMM imputations, we applied two relationships, “average” correlation coefficients and “linear” correlation coefficients, to estimate a missing column. The average Coverage CI of a missing column was expected to be similar across attributes for each dataset.
The MAHC imputation was very efficient as the average CI coverage rates were good (above 80%) for each dataset (Table 2). This was especially so for Dataset 1 where the average CI coverage rates were close to 85% for some attributes. In general, the average CI coverage rates for NORM imputation for both analysis were the next largest, followed by the other two imputations (NRM and PMM). The results of Bayesian analysis of NORM imputations had slightly larger average CI coverage values than those for non-Bayesian analysis. The standard errors of average CI coverage rates for MAHC imputation were smallest for a missing column for each attribute compared with other imputation methods. The standard errors for NORM imputation for Bayesian analysis were higher than those for non-Bayesian analysis for some attributes and lower for others. For NRM and NRM imputations, average correlation analysis had larger average CI coverage values and the lower standard errors than those from linear correlation analysis.
In general, the average coverage rates for both analysis of NORM imputation were relatively larger than those for NRM and PMM imputations. Again, the results of Bayesian analysis of NORM imputations had slightly larger average coverage rates than those from non-Bayesian analysis. The results of average correlation analysis of NRM and PMM imputations also had slightly larger average coverage rates and lower standard errors than those from linear correlation analysis.
Using the linear correlation analysis to calculate the missing column had slightly lower coverage rates than using the average correlation analysis (Table 2). Therefore, we subsequently used the average correlation analysis to compute the missing column for NRM and PMM imputation methods, in conjunction with Bayesian and non-Bayesian analysis to estimate missing cells for these two methods.
Consequently for missing values overall, we compared MAHC, NORM imputation with Bayesian and non-Bayesian analysis, NRM and PMM imputation with Bayesian and non-Bayesian analysis where the estimation of a missing column was via the average correlation analysis. For each MI approach and the EM algorithm for each percentage of missing values (including missing cells and a missing column), we obtained final estimated data values (by averaging over the 100 imputed estimates for each missing value) for each of 10 repetitions. The NRMSE values for assessing the accuracy of the MI approaches and the EM method for estimating missing values (by calculating the differences between the original “true” values and the estimated values) for each of the 10 replications for each percentage of missing values for each of Datasets 1 to 6 were displayed using a boxplot with the addition of the mean value (Figs 3–5). Low values of NRMSE correspond to better performance.
Comparison of the estimated missing values (including missing cells and a missing column) using MAHC, NORM-BA, NORM-NBA, NRM-BA, NRM-NBA, PMM-BA, PMM-NBA and EM methods with the “true” values for Datasets 1 and 2 for each percentage (5%, 10%, 15%, 20% and 25%) of missing data.
Comparison of the estimated missing values (including missing cells and a missing column) using MAHC, NORM-BA, NORM-NBA, NRM-BA, NRM-NBA, PMM-BA, PMM-NBA and EM methods with the “true” values for Datasets 3 and 4 for each percentage (5%, 10%, 15%, 20% and 25%) of missing data.
Comparison of the estimated missing values (including missing cells and a missing column) using MAHC, NORM-BA, NORM-NBA, NRM-BA, NRM-NBA, PMM-BA, PMM-NBA and EM methods with the “true” values for Datasets 5 and 6 for each percentage (5%, 10%, 15%, 20% and 25%) of missing data.
For Datasets 1 and 2 (the real datasets), the values of NRMSE for estimating the performance of EM, NRM-NBA, and NRM-BA imputation were comparably larger than those for the other imputations at every percentage of missing values (Fig 3). Overall, MAHC, NORM-BA and NORM-NBA imputations had relatively lower values of NRMSE, so performed better.
For Datasets 3 and 4 (of size 60×10×6 and 80×15×6, respectively), the values of NRMSE for estimating the performance of EM, NRM-NBA, and NRM-BA imputations were relatively higher than those for the other methods (Fig 4). Generally, MAHC imputation had much better performance than the other imputation methods for these two datasets. The variability of NRMSE values for EM imputation for Dataset 3 was much larger than for the other imputation methods. Overall, the variability of NRMSE values for Dataset 4 was smaller than for Dataset 3.
For Datasets 5 and 6 (of size 100×20×5 and 120×60×4, respectively), the values of NRMSE for estimating the performance of EM, NRM-NBA, and NRM-BA imputations were relatively higher than those for the other imputations (Fig 5). MAHC imputation had much better performance than the other imputation methods. The variability of NRMSE for these two datasets was much smaller than for the other datasets.
The overall best performance for estimating randomly generated missing values (including both missing cells and a missing column) for these six datasets was MAHC imputation, followed by NORM imputation for both Bayesian and non-Bayesian analysis. The number of missing values in relation to the size of the data array had an impact on the accuracy of estimating performance, with larger values of NRSME corresponding to larger percentages of missing data. However, the variability of the NRMSE values was smaller when the size of the dataset increased.
We investigated one multiple hierarchical clustering method (MAHC imputation) and three multiple imputation approaches (NORM, NRM and PMM) with and without Bayesian analysis for estimating missing values in three-way three-mode MET datasets.
To conduct Bayesian analysis using Gibbs sampling, we needed to assume some prior distributions for the parameters in the normal distribution model (NORM) and normal regression model (NRM). Also, we needed to set up some hyper-parameters in the prior distribution for NRM imputation. Thus, compared with NORM imputation, NRM imputation introduced more uncertainty and produced lower CI coverage rates and higher average values of NRMSE.
For the PMM imputation with and without Bayesian analysis, the estimated missing values were actual non-missing values from another cell plus a random standard normal term multiplied by an appropriate variance, where these actual non-missing values were those for which their predicted values were closest to the predicted missing values. The predicted values of non-missing values were obtained using the NRM model. The closest distance between predicted values of each cell based on the NRM model may differ from the closest distance between actual values of each cell. When that is the case, PMM imputation had lower coverage rates and higher average values of NRMSE than NORM imputation.
During the implementation of MAHC imputation, we employed different combinations of attributes for each imputation. Thus, the final estimated values using MAHC imputation were based on the average of the corresponding values from the same genotype in the environment (or environment group) deemed most like that particular environment when the “similarity” assessment used different combinations of attributes. It took advantage of multivariate measurements by combining the attributes in various ways in the estimation procedure. Another advantage was that there was no other uncertainty in MAHC imputation. The three other imputation approaches required the estimation of parameters in their underlying models. As a result, it is probably not surprising that MAHC imputation performed best.
For the three imputation models (NORM, NRM, and PMM), imputations with Bayesian analysis had slightly higher accuracy than those corresponding imputation models without Bayesian analysis, but they took more computer time to implement. For all comparisons with these imputations, MAHC imputation had the smallest implementation time and the highest accuracy.
For estimating a missing column, we made two different assumptions for the correlation among attributes for each environment. The correlation coefficients for the missing environment for a particular attribute were not available. Therefore, we decided that the missing correlation coefficients could be the average value of non-missing correlation coefficients or a linear combination of non-missing correlation coefficients. However, according to the results of our investigation, the first assumption (based on the average correlation coefficients) had better estimation performance than the second assumption (based on a linear combination of correlation coefficients). The linear combination of correlation coefficients was calculated using all other non-missing correlation coefficients among the attributes, while the average value of correlation coefficients was calculated from other non-missing correlation coefficients with this particular attribute. This is likely to be the reason that the average value introduced less uncertainty and produced higher CI coverage.
Multivariate multi-environment trial datasets are the focus of our research. As many of these are incomplete, it is important to estimate the missing values to form a complete array that can then be analysed by three-way pattern analysis methodology which has proven valuable for that situation. However, the above imputation methods could apply to other types of datasets where various measurements are made on the same entities under different conditions.
S1 Datasets. Datasets used in this study.
This compressed file contains six directories for each of six datasets. A description of each text file is included in the corresponding directory.
S1 File. R source code for multiple imputation approaches.
Conceived and designed the experiments: TT KB. Performed the experiments: TT. Analyzed the data: TT. Contributed reagents/materials/analysis tools: TT GM MD KB. Wrote the paper: TT KB.
- 1. Kroonenberg PM, Basford KE. An investigation of multi-attribute genotype response across environments using three-mode principal component analysis. Euphytica. 1989;44(1–2):109–23. pmid:WOS:A1989AZ68600014.
- 2. Kroonenberg PM, Basford KE, Ebskamp AGM. Three-way cluster and component analysis of maize variety trials. Euphytica. 1995;84(1):31–42. pmid:WOS:A1995RU08000004.
- 3. Little RJ, Rubin DB. Statistical analysis with missing data: Wiley New York; 1987.
- 4. Allison PD. Missing data: Sage publications; 2001.
- 5. Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.
- 6. Schafer JL. Analysis of incomplete multivariate data. London: Chapman & Hall; 1997.
- 7. Kroonenberg PM, van Ginkel JR. Combination rules for multiple imputation in three-way analysis illustrated with chromatography data. Curr Anal Chem. 2012;8(2):224–35. pmid:WOS:000303474200005.
- 8. Kiers HAL. Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics. 2000;14(3):105–22. pmid:WOS:000087515800002.
- 9. Kroonenberg PM. Applied multiway data analysis. Hoboken, New Jersey: John Wiley and Sons, Inc.; 2008.
- 10. Basford KE, Kroonenberg PM, Delacy IH, Lawrence PK. Multiattribute evaluation of regional cotton variety trials. Theoretical and Applied Genetics. 1990;79(2):225–34. pmid:WOS:A1990CN29300014.
- 11. de la Vega AJ, Hall AJ, Kroonenberg PM. Investigating the physiological bases of predictable and unpredictable genotype by environment interactions using three-mode pattern analysis. Field Crops Research. 2002;78(2–3):165–83. pmid:WOS:000178168500007.
- 12. Rubin DB. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics. 1986;4(1):87–94.
- 13. Little RJA. Missing-data adjustments in large surveys. Journal of Business and Economic Statistics. 1988;6(3):287–96. pmid:WOS:A1988P400300001.
- 14. Van Buuren S. Flexible imputation of missing data: CRC press; 2012.
- 15. Kim C-J, Nelson CR. State-space models with regime switching: classical and Gibbs-sampling approaches with applications: MIT press Cambridge; 1999.
- 16. Ward JH. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association. 1963;58(301):236–44. pmid:WOS:A1963P102700016.
- 17. Box GE, Tiao GC. Bayesian inference in statistical analysis: John Wiley & Sons; 2011.
- 18. Gelman A. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis. 2006;1(3):515–34.
- 19. Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. 2003;19(16):2088–96. pmid:WOS:000186448900011.
- 20. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological). 1977;39(1):1–38. pmid:WOS:A1977DM46400001.
- 21. McLachlan G, Krishnan T. The EM Algorithm and Extensions. Statistics in Medicine. 1998;17:1187–.
- 22. Tucker LR. Some mathematical notes on three-mode factor analysis. Psychometrika. 1966;31(3):279–. pmid:WOS:A19668087000001.
- 23. Steele RJ, Wang N, Raftery AE. Inference from multiple imputation for missing data using mixtures of normals. Statistical methodology. 2010;7(3):351–64. pmid:20454634; PubMed Central PMCID: PMC2862970.
- 24. Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychol Methods. 2002;7(2):147–77. pmid:WOS:000176079500001.
- 25. Soullier N, de La Rochebrochard E, Bouyer J. Multiple imputation for estimation of an occurrence rate in cohorts with attrition and discrete follow-up time points: A simulation study. BMC Medical Research Methodology. 2010;10(1):79.
- 26. Basford KE, Tukey JW. Graphical profiles as an aid to understanding plant breeding experiments. Journal of Statistical Planning and Inference. 1997;57(1):93–107. pmid:WOS:A1997WH35000009.
- 27. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58(3):545–54. pmid:ISI:A1971L054700013.
- 28. Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2011;73(1):3–36.
- 29. Gilmour AR, Gogel BJ, Cullis BR, Welham SJ, Thompson R. ASReml User Guide Release 2.0. Hemel Hempstead, UK: VSN International Ltd.; 2006. 267 p.