A theoretical framework for Landsat data modeling based on the matrix variate mean-mixture of normal model

This paper introduces a new family of matrix variate distributions based on the mean-mixture of normal (MMN) models. The properties of the new matrix variate family, namely stochastic representation, moments and characteristic function, linear and quadratic forms as well as marginal and conditional distributions are investigated. Three special cases including the restricted skew-normal, exponentiated MMN and the mixed-Weibull MMN matrix variate distributions are presented and studied. Based on the specific presentation of the proposed model, an EM-type algorithm can be directly implemented for obtaining maximum likelihood estimate of the parameters. The usefulness and practical utility of the proposed methodology are illustrated through two conducted simulation studies and through the Landsat satellite dataset analysis.


Introduction
The skew-normal (SN) distribution, initially introduced by Azzalini [1], has received considerable attention in both theoretical and applied statistics in the past two decades. Various extensions, forms and properties of the SN distribution in the multivariate case were derived in [2][3][4][5], and the acknowledged articles therein. An interesting form of the SN distribution was presented by Pyne et al. [3] who named it the restricted multivariate SN (rSN) model. Generally, the rSN distribution can be expressed as a linear transformation of the multivariate normally distributed random vector and the univariate truncated normal distribution. Although the rSN model, like the original SN one, can describe the skewness of data, it still is not robust in dealing with the outlying observations. To cover this drawback, Negarestani et al. [6] used the rSN transformation to introduce the family of multivariate mean mixture of normal (MMN) model. Specifically, a p-dimension random vector X is in the family of MMN distributions if where '¼ d ' stands for the equality in distribution, Z follows the multivariate normal model with zero mean and covariance matrix S, and W is an arbitrary random variable independent of Z. It is clear that the rSN distribution is a special case of (1) where the mixing variable W is followed by the truncated standard normal distribution lying within a truncated interval (0, 1), denoted by W � T N ð0; 1; ð0; 1ÞÞ. It is shown by Negarestani et al. [6] that the family of MMN may provide a new model with wider range of skewness and kurtosis than the rSN, skew-t [4] and skew Student-t-normal [7] distributions. From (1), the probability distribution function (pdf) of random vector X can be presented as f MMN p ðx; μ; λ; Σ; νÞ ¼ where ϕ p (�;�) denotes the pdf of multivariate normal distribution and h(�; ν) is the pdf of W parameterized by the vector parameter ν. The notation X � MMN p ðμ; λ; Σ; WÞ will be used to indicate that X has pdf (2). Depending on the random variable W that can take values on the real line, the pdf (2) can be both symmetric and asymmetric. However, a more flexible and skewed version of the MMN model can be obtain if W has any asymmetric distribution or any positive support model like the truncated-normal, exponential and gamma distributions. Moreover, the pdf (2) can include skew-elliptical models, as the rSN distribution, or can result in skew non-elliptically contoured models if, for example, W is distributed as the exponential, Weibull and gamma models. From Fig 2 in Appendix A, it is observed that the family of MMN distributions offers different orientation compared with the family of mean-variance mixture of normal (MVMN) distributions [8].
Matrix variate distribution finds its genesis in modeling dependent multivariate observations in the normal case [9]. The recent use of the matrix variate normal (MVN) distribution can be found in modeling a wide variety of three-way data appearing in studies including control theory, stochastic systems, image recognition, repeated vector measurements, multivariate time series, spatial data, among others [10,11]. The MVN distribution not only inherits some appealing properties, features as well as widespread applications from the multivariate normal model, but also it is still not stable and robust against non-normal features such as asymmetry and heavy tails. To deal with the heavy tailed data, Kshirsagar and Bartlett [12] proposed the matrix variate t distribution by showing that the estimator of the parameter matrix of regression coefficients unconditionally follows matrix variate t model. Bulut and Arslan [13] proposed the matrix variate slash distribution as a scale mixture of the matrix variate normal and the uniform distributions. Moreover, in accommodating skewness and kurtosis, the interest of skew distributions provides a platform for robust extension of matrix variate distribution. For instance, works on the matrix variate versions of SN distribution can be found in [14][15][16][17]. Even though the matrix variate SN distribution has many attractive properties, it suffers from robustness in dealing with heavy tailed data and from parameter estimation. Regarding these drawbacks of the matrix variate SN model and considering the aforementioned properties of the MMN family of distributions, the objective of this paper is to propose a family of matrix variate mean-mixture of normal (MVMMN) distributions. Some properties and features of our introduced family such as moments, the characteristic function, marginal and conditional distributions are studied. The maximum likelihood (ML) estimate of model parameters are computed by applying expectation-maximization (EM) type algorithm [18].
The contribution of this work can be broken down into six parts. We will begin the usual procedure with the model formulation of the MVMMN distribution in Section 2. Properties and characteristics of the MVMMN distribution are also studied in Section 3. The parameter estimation procedure using the EM-type algorithm and some computational strategies of implementation are given in Section 4. To examine the performance of the methodology into practice, simulation and real-world data analyses are presented in Sections 5 and 6. Finally, Section 7 gives some concluding remarks and future extensions.

Proposed family
To start the whole process, we begin with some notations and definitions. A random matrix variable X 2 R p�n defined as  [19,20], assumes that both the mean and variance of the population member are not fixed. Therefore, an interesting extension of the MVMMN distribution can be introduced by considering the family of scale mixture of MVMMN distributions.

Special cases
which completes the proof after using some matrix factorizations.
• Convolution with exponential model: In a similar manner as Proposition 1, the proof can be completed.

Theorem 1 The MVMMN distribution is log-concave if W has log-concave pdf.
Proof. Based on [21], if f(x) and g(y) are log-concave functions, then their convolution, i.e., is also a log-concave function. Hence, the property of vectorization operator of the MVMMN distribution (7) and the fact that the MVN is log-concave completes the proof if W has a log-concave pdf.

Corollary 1 The RMVSN, MVMMNE and MVMMNW distributions are log-concave.
Proof. Since the truncated normal, exponential and Weibull (if the shape parameter is �1) distributions are log-concave, their associated matrix variate models are, using Theorem 1.

Characteristics
This section provides some substantial statistical properties of the MVMMN distribution.

Theorem 2
If Y � MVMMN p;n ðM; Λ; Σ; Ψ; WÞ, then the mean and the characteristic function of Y, respectively, are Proof. The proof of theorem can be completed by using the presented representations in Definition 1. Taking expectation on both sides of the stochastic representation (4) the first part is proved. Moreover for the second part, recall that the characteristic function of the matrix variate X � N p;n ðM; Σ; ΨÞ is given as Hence, through the hierarchical representation (5), the characteristic function of Y is obtained by Proof. (i) follows by using the hierarchical representation (5) and applying theorems 2.3.3 of [22]. For M = 0, it is clear from part (i) that Therefore, we have which completes the proof. Now, applying this transpose property of the MVN distribution into the hierarchical representation (5) results in Proof. The proof of the theorem is completed through obtaining the characteristic function of BYD: where T 1 = DT > B. Now, by applying Theorem 2, we have which is the characteristic function of MVMMN q;m ðBMD; BΛD; BΣB > ; D > ΨD; WÞ. Theorem 6 Let Y � MVMMN p;n ðM; Λ; Σ; Ψ; WÞ, and partition Y, M, Λ, S, and C as where Y 11 ; M 11 ; Λ 11 2 R q�m ; Σ 11 2 R q�q , and Ψ 11 2 R m�m . Then, Proof. The proof of (i) is completed by considering proper matrices B and D in Theorem 5. Using the hierarchical representation (5) and applying theorem 2.3.12 of [22], the second part of the theorem is proven.
Corollary 2 If Y � RMVSN ðM; Λ; Σ; ΨÞ and under partition of Theorem 7, we have Corollary 3 If Y � MVMMN EðM; Λ; Σ; ΨÞ and under partition of Theorem (7), we have The presentation of distribution of the matrix quadratic form, done by [23], can also be implemented in the context of the MVMMN family of distributions. Referring to theorem 2.2 of [23], they defined the distribution of quadratic form Q ¼ XAX > to be Q p ðA; M; ΣÞ where A is a n × n symmetric real matrix of rank r, and X � N p;n ðM; Σ; I n Þ.
Theorem 8 Let Y � MVMMN p;n ðM; Λ; Σ; Ψ; WÞ and W * h(w;ν) and A n×n any n × n symmetric matrix of rank r. Then, conditionally on W = w, are identically distributed, where δ j are the non-zero eigenvalues of Ψ À 1 2 AΨ À 1 2 and B j are independent non-central Wishart distribution B j jW ¼ w � W p ð1; Σ; m j m > j Þ for j = 1, . . ., r, where m j = Ma j and a j are the corresponding orthogonal eigenvectors (a > j a j ¼ 1). Proof. Using hierarchical representation (5) On the other hand, through theorem 2.2 of [23], we have Therefore, the random matrices Q and B have identical distributions.
To obtain ML estimate of Θ, an EM-type algorithm is implemented as a powerful estimation approach in dealing with the unobserved (missing and/or censored) data and latent variables [18]. The computations of EM algorithm are based on two iterative E-and M-steps. In Estep, the expected value of the complete-data log-likelihood function, the likelihood of the observed and missing data the latent variable, is computed, while in M-step, parameter estimates are updated by maximizing this expected value.
Through the hierarchical representation (5), the complete-data log-likelihood function of Θ, obtained by introducing latent variables W = (w 1 , . . ., w N ) and omitting additive constants, is ML estimation of Θ is performed by using the expectation-conditional maximization (ECM; [24]) algorithm as follows.
• Initialization: Set the number of iteration to k = 0 and choose a relative starting point Θ (k) = (M (k) , Λ( k ), S (k) , C (k) , ν (k) ). We point out that in our data examples the parameters are initialized by M ð0Þ ¼ P N i¼1 Y i =N, Λ (0) = 1 p×n , S (0) = c 1 I p , C (0) = c 2 I n . Here, 1 p×n is a matrix of dimension p × n with unit elements. Moreover, the elements of two vectors c 1 and c 2 are computed, respectively, as • E-step: The expected value of the complete-data log-likelihood function (10), called Q-function, is computed as • First CM-step: Maximizing Q-function with respect to M and Λ give the following updatê i . • Second CM-step: Update S and C, respectively, • Third CM-step: The additional parameter ν depending on the distribution of W i is updated byν ðkÞ i : (11) can be obtained by Lemma 1 and Propositions 1, 2 and 3 for our three considered models. Furthermore, we note that in all special cases considered in Section 4, the distribution of mixing random variable W, is parameter free. Therefore, the last step of the ECM algorithm is not necessary.

Computational aspects
4.1.1 Convergence. The process of the EM algorithm can be iterated until a suitable convergence rule, like max kŶ ðkþ1Þ ÀŶ ðkÞ k� ε or j'ðŶ ðkþ1Þ Þ À 'ðŶ ðkÞ Þj � ε, is satisfied where ε is a user specified tolerance and ℓ(�) is defined in (9). An alternative approach to determine convergence of the EM algorithm is the Aitken acceleration method [25]. To apply this approach, the asymptotic estimate of the log-likelihood at iteration k + 1, following [26], can be obtained as Therefore, the algorithm can be considered to have converged at iteration k + 1 when ' 1 ðŶ ðkþ1Þ Þ À 'ðŶ ðkÞ Þ < �, [27]. In our study, the tolerance � is considered as 10 −5 .

Model selection.
The models in competition in our data analysis are compared using the most commonly used measures Akaike information criterion (AIC; [28]) and Bayesian information criterion (BIC; [29]) defined as where m is the number of free parameters and ℓ max is the maximized log-likelihood value. Models with lower values of AIC or BIC are considered more preferable.

Simulation studies
In this section, the performance of our model and its computational method is illustrated by conducting two simulation studies. The first simulation study aims at comparing the special cases of MVMMN model in dealing with skewed and leptokurtic simulated data. The second simulation study demonstrates whether our proposed ECM algorithm can provide good asymptotic properties.

Example 1 Model performance
In this experiment, simulated data are generated from a matrix variate normal inverse Gaussian (MVNIG; [20]) distribution with sample sizes N = 50, 100, 500, 1000 and 2000, to compare the performance of three special cases of MVMMN model. The MVNIG distribution belongs to the family of MVMVM models where the mixing random variable follows the GIGðÀ 0:5; w; cÞ, such that GIG denotes the generalized inverse Gaussian distribution with parameter (κ, χ, ψ) [30]. We consider this matrix variate distribution to generate non-normal data as it offers the desired level of asymmetry and leptokurtosis. Let χ = ψ = 3 and   Table 2 shows the average Frob. norm of ðM ÀM i Þ; ðΛ ÀΛ i Þ; ðΣ ÀΣ i Þ and ðΨ ÀΨ i Þ, whereM i ;Λ i ;Σ i and Ψ i are the ML estimates of the fitted model in the ith replication. It is observe that the Frob. norm decreases when the sample size increases. We can also see that the Frob. norm for S and C for all models are very close to each other while the MVMME model has the furthest estimates of M and Λ.

Example 2 Performance of the model under AR(1) dependent structure
In order to investigate the effect of auto-regressive (AR(1)) dependent structure in S and Λ to the parameter estimates, we conduct another Monte Carlo simulation. In this experiment, we set A = 0 and C −1 = I 4  where λ = 0.5, 2 and ρ = 0.5, 0.8. For generating a random sample from the MVMMN model, the value 0.001 is added to the diagonal elements of S to ensure that it is a positive definite matrix.
In each replication of 200 trials, the we generate data from the MVNIG distribution with true parameter values displayed above and χ = ψ = 3 for the sample sizes N = 100 and 1000. By fitting the RMVSN, MVMMNE and MVMMNW distributions to the generated data, the Frob. norm of ðM ÀMÞ; ðΛ ÀΛÞ; ðΣ ÀΣÞ and ðΨ ÀΨÞ are obtained. Table 3 summarizes the average Frob.

Analysis of Landsat data
To investigate the performance of the developed model in real-world data analysis, we consider Landsat satellite data (LSD) originally obtained by NASA and available at Irvine machine learning repository (http://archive.ics.uci.edu/ml). Each line of the LSD contains of four spectral values of nine pixel neighborhoods in a satellite image. In other words, the lines of LSD are related to a matrix of observations of 4 × 9 dimension. Moreover, each of the LSD matrix of observations belongs to one of six different classes, namely red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil. In our analysis, we focus on two classes, the red soil and cotton crop, with size 461 and 224, respectively, for illustrative purposes. We fitted RMMVSN, MVMMNE and MVMMNW distributions by implementing the ECM algorithm. Table 5 shows a summary of ML fitting results, including the parameter estimates, maximized log-likelihood values, AIC and BIC of the three fitted models. It is observed that the MVMMNW and MVMMNE distributions respectively outperform the others for the red soil and cotton crop data. Based on the values of the shape matrix Λ, it is clear that the estimated skewness parameters are moderately to highly significant, showing that the distribution of matrix observation is skewed. Moreover, the estimated scale matrices S and C highlight the covariance structure in the data.

Conclusion
This paper has introduced a new family of matrix variate distributions whose component pdfs arise from the mean-mixture of matrix variate normal model. Some properties and characteristics as well as three special cases of the new model are derived. We have developed a computationally EM-based algorithm for calibrating the matrix type parameters to the data. It is shown that the MVMMN distribution is closed under the formation of marginal and conditional distributions and under affine transformation which make it flexible to use in the various fields of three-variate data analysis, such as multivariate time series, image processing and longitudinal data analysis. Simulation results show that the ML estimates obtained via the ECM algorithm are empirically consistent. Moreover, numerical results from application to real dataset reveal that the proposed model is well suited in dealing with the skewed matrix variate experimental data.
The utility of our current approach can be extended to accommodate censored data based on a recent work studied in the multivariate case by [31,32]. It may also be interesting to propose a family of scale mixture of MVMMN distribution to deal with heavy tailed three-way data. Another possible extension of the work herein is to consider finite mixture model based on the MVMMN distribution as a promising tools in classification and clustering heterogeneous matrix-valued asymmetric data [19,33]. It would be of interest the distributions of the associated eigenvalues of the quadratic form (Theorem 8; for the complex form) to compute the channel capacity in wireless communication systems, since experimental data do not follow necessarily a normal distribution (see [34,35]). All computations were carried out by R language and the computer program is available from the first author upon request.