## Figures

## Abstract

The connections between the error function used in multilinear regression and the expected, or assumed, properties of the data are investigated. It is shown that two of the most basic properties often required in data analysis, scale and rotational invariance, are incompatible. With this, it is established that multilinear regression using an error function derived from a geometric mean is both scale and reflectively invariant. The resulting error function is also shown to have the property that its minimizer, under certain conditions, is well approximated using the centroid of the error simplex. It is then applied to several multidimensional real world data sets, and compared to other regression methods.

**Citation: **Holmes MH, Caiola M (2018) Invariance properties for the error function used for multilinear regression. PLoS ONE 13(12):
e0208793.
https://doi.org/10.1371/journal.pone.0208793

**Editor: **J. Malo,
Universitat de Valencia, SPAIN

**Received: **February 27, 2018; **Accepted: **November 26, 2018; **Published: ** December 26, 2018

**Copyright: ** © 2018 Holmes, Caiola. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The data sets used are all publicly available: EMAP Surface Waters Lake Database: 1991-1994 Northeast Lakes Water Chemistry Data Summarized by Lake https://archive.epa.gov/emap/archive-emap/web/html/nelakes.html MAIA-Estuaries Summary Database: 1997 and 1998 Stations https://archive.epa.gov/emap/archive-emap/web/html/index-70.html National Data Buoy Center: Historical Meteorological Data - Station 46006 www.ndbc.noaa.gov/histsearch.php?station=46006 U.S. Census Bureau Statistical Abstract of the United States: 2012 Modeling wine preferences by data mining from physicochemical properties http://www.sciencedirect.com/science/article/pii/S0167923609001377.

**Funding: **This work was supported by National Science Foundation (MH and MC received funding) (https://www.nsf.gov/div/index.jsp?div=DMS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

The problem considered here concerns the modeling assumptions made in multilinear regression, and their role in determining the error function. To provide a simple example, a principal component analysis (PCA) is used to find low dimensional subspace approximations of a data set. It has the property that if the data set is rotated, and a PCA is used, the rotated versions of the same low dimensional subspace approximations are obtained. This means that a PCA is rotationally invariant. This is one of the reasons that it is often used in face recognition [1, 2], visual tracking [3], and other pattern recognition problems.

In contrast, linear least squares is not rotationally invariant. However, unlike a PCA, it is scale invariant. What this means is that if you scale the variables in the data set, the resulting minimizer is the scaled version of what is obtained for the unscaled case (this is explained more precisely in the next section). Scale invariance is important, for example, if you want your minimizer to be independent of what units are used (e.g., inches versus centimeters). The fact that a PCA is scale dependent, and that it is possible to be fairly sensitive to the scaling, is well-known [4, 5].

A third type of invariance, which will play a central role in this paper, concerns the order the variables are listed or labeled. Specifically, if the minimizer is unaffected by the reordering of the variables, the result is said to be reflectively invariant. A simple example of where this is important is edge detection, where the minimizer should be unaffected by which axis is labeled *x*, *y* or *z*. It is not hard to show that a PCA is reflectively invariant, but that linear least squares is not. It needs to be pointed out that there are different forms of reflective invariance depending on the hyperplane used for the reflection. As an example, in computer vision it is desirable to be able to recognize an object regardless of whether you are looking at it, or at its horizontal reflection as seen when looking in a mirror [6]. In this case the reflection is through the vertical plane. There is also some variation in how to refer to reflective invariance as used here. As a case in point, it has been referred to as neutral data fitting because, for this form of invariance, the variables must be treated symmetrically [7, 8].

Obtaining a regression result with particular invariance properties has been considered earlier, although often framed in somewhat different language. One of the earliest studies concerned reflective invariance for a bivariate problem using area as a measure of error [9]. This is the *E*_{A} example illustrated in Fig 1. This was the impetus for Samuelson [10] to propose certain invariant properties one might expect, or require, of the error function. Since then numerous attempts have been made to find error functions that have one or more of these properties. An example of a straightforward approach for two variable linear regression is to concentrate on the slope of the regression line. It has been proposed that, after determining the lines for vertical and horizontal least squares, *E*_{y} and *E*_{x} in Fig 1, to simply use the line whose slope is determined using an averaging method involving the two slopes. A review and comparison of these, and similar methods, can be found in [11, 12].

A more fundamental approach is to concentrate on the error function. One possibility is to use true distance, *E*_{D} in Fig 1, and this is easily generalized to multilinear regression (a PCA is an example). Another possibility is to use area, *E*_{A} in Fig 1, which is what Woolley used, and this gives rise to what is called least product regression, or least area regression. Work has been done on how to generalize Woolley’s idea, and use symmetry methods to obtain invariant error functions in two and three dimensions [7, 8]. However, this is not easily generalized to multilinear regression. An alternative, which is most relevant to the present study, is to use the geometric mean for each data point and then find the least squares value for this function [13]. This differs from the geometric mean considered here, which involves the geometric mean of the ordinary least squares error functions. The exception to this statement occurs when a hyperplane approximation is used, in which case the two formulations are equivalent. In the current study a hyperplane approximation is considered, but so are the other lower dimensional approximations that are possible (similar to what is done using a PCA).

In the next section it is shown that scale and rotational invariance are incompatible. This is done, for the case of two variables, by first characterizing mathematically what is needed to obtain particular invariant properties, and then to use similarity methods to demonstrate the incompatibility. In addition, in this two-dimensional setting, an error function that has several important invariance properties is formulated and discussed. In the subsequent two sections, the extension of this function to multilinear linear regression problems is considered, which includes showing that the minimizer is well-approximated using the centroid of an error simplex. With this, the error function is used to analyze real world data sets and compared to other regression methods.

## Two variables

To begin, we start with the case of two variables. Assume that the data are (*x*_{1}, *y*_{1}), (*x*_{2}, *y*_{2}), ⋯, (*x*_{n}, *y*_{n}), and they are centered. This means that ∑*x*_{i} = ∑*y*_{i} = 0. It is assumed in what follows that the data vectors **x** = (*x*_{1}, *x*_{2}, ⋯, *x*_{n})^{T} and **y** = (*y*_{1}, *y*_{2}, ⋯, *y*_{n})^{T} are not orthogonal.

For the model function *y* = *αx*, there are numerous ways to measure the error and four possibilities are shown in Fig 1. For a PCA the true distance is used, and this leads to the error function

Because of the denominator, for this expression to be defined, *α* must be dimensionless. What this means is that, when *x* and *y* have different dimensions, it is first necessary to nondimensionalize the variables before writing down the formula for the error function. So, suppose the data are scaled as *X*_{i} = *x*_{i}/*S*_{x} and *Y*_{i} = *y*_{i}/*S*_{y}, where *S*_{x} and *S*_{y} are positive. The model function is now , and the corresponding error function is

Minimizing this, and then transforming back to dimensional variables, one finds that (1) where (2) for (3)

In the above expression, the + is used if **X** ⋅ **Y** > 0 and the − is used if **X** ⋅ **Y** < 0. What is evident from this calculation is that *α* depends on *S*_{x} and *S*_{y}. Typical choices for these scaling factors include *S*_{x} = ||**x**||_{∞}, , and *S*_{x} = ||**x**||_{1}/*n* (with similar expressions for *S*_{y}). Depending on the scatter in the data, the values of these quantities can be significantly different and this can result in rather dramatic differences in the corresponding value of *α*.

### Scale invariance

There is a simple test for scale invariance that comes from the above analysis. To derive it, in the original *x*, *y*-coordinates and using the model equation *y* = *αx*, whatever regression procedure is used will result in the slope *α* depending on the data. This is written as *α* = *α*(**x**, **y**). Using the scaling *X* = *x*/*S*_{x} and *Y* = *y*/*S*_{y}, the model equation becomes , where . Now, scale invariance requires that . Combining these two results, the requirement for scale invariance is that
(4)
for any positive values of *S*_{x} and *S*_{y}. Using the infinitesimal generators *S*_{x} = 1 + *ε* and *S*_{y} = 1 + *δ*, then the above equation takes the form [14]
where ∇_{x} is the gradient in the **x** variables (and similarly for ∇_{y}). The *O*(*ε*) and *O*(*δ*) requirements are that

These are easily solved using the *n*-dimensional version of spherical coordinates. Specifically, letting
where *r* = ||**x**||_{2} and the *ϕ*_{i}’s are the angular coordinates, then **x** ⋅ ∇_{x} *α* = −*α* reduces to *r*∂_{r} *α* = −*α*. The general solution of this is *α* = *c*/*r*, where *c* can depend on the *ϕ*_{i}’s. Doing something similar for **y**, the conclusion is that, to be scale invariant, the minimizer must depend on the data as
(5)
where *F* can depend on the values of the angular coordinates for **x** and **y**.

### Rotational invariance

To determine the requirement for rotational invariance, assuming the angle of rotation is *θ*, then

The model function now has the form *y*′ = *α*′ *x*′, and invariance requires that

Writing the minimizer of the error function as *α* = *f*(**x**, **y**), then *α*′ = *f*(**x**′, **y**′) results in the requirement that

Using the infinitesimal generator *θ* = *ε*, then the *O*(*ε*) requirement is
(6)
where *f* = *f*(**x**, **y**). As it should, the solution in Eqs (2) and (3) satisfies this nonlinear partial differential equation. What does not satisfy the equation is Eq (5). It is easiest to illustrate this assuming there are only two data points. In this case, the 2-dimensional spherical coordinates used in Eq (5) can be written as *x*_{1} = *r* cos *k*, *x*_{2} = *r* sin *k*, *y*_{1} = *R* cos *K*, and *y*_{2} = *R* sin *K*. Substituting Eq (5) into Eq (6), and reducing gives

Given that *F* is independent of *r* and *R*, the above equation leads to the following two equations
and

These equations are solvable by introducing the change of variables *s* = *k* − *K*, *t* = *k* + *K*, and from this one finds that it is not possible to find a function *F* that satisfies both equations. In other words, there is no function which satisfies both Eqs (5) and (6).

### Scale and reflectively invariant error function

The conclusion from the above analysis is that it is not possible for the minimizer to be both scale and rotationally invariant. It is known that there are rotation and reflectively invariant error functions, and an example is one that uses true distance (*E*_{D} in Fig 1). So, the question considered here is whether there are error functions which satisfy all of the stated conditions except for rotational invariance.

It is relatively easy to find scale invariant error functions. For example, using the usual (vertical) least squares error
(7)
the minimum occurs when *α* = **x** ⋅ **y**/**x** ⋅ **x**. Similarly, using the (horizontal) least squares error
(8)
the minimum occurs when *α* = **y** ⋅ **y**/**x** ⋅ **y**. Both of these are scale invariant. What *E*_{x} and *E*_{y} are not, however, are reflectively invariant. To be reflectively invariant it is required that irrespective of which variables are considered independent or dependent, that an equivalent result is obtained. This means that the minimizer of *E*_{x} is the same as the one obtained for *E*_{y}. Mathematically, the requirement is that
(9)

The error function to be considered here is based on the geometric mean of the ordinary least squares error functions. In the case of two variables, the error function is
(10)
where *E*_{x} and *E*_{y} are given in Eqs (7) and (8), respectively. Minimizing this one finds that
(11)
where the + is used if **x** ⋅ **y** > 0 and the − is used if **x** ⋅ **y** < 0. This satisfies the change of variables condition Eq (4), and this guarantees *α* is scale invariant. It is also reflectively invariant because it satisfies Eq (9). Another way to conclude that it is reflectively invariant is to note that the error function Eq (10) is a symmetric function of *E*_{x} and *E*_{y}.

The minimizer in Eq (11) is well-known and can be obtained in a number of ways. This includes using geometric mean regression [9], using the geometric means of the minimizers for Eqs (7) and (8), and by using simple symmetry arguments [10]. What is not known is a way to generalize it to multilinear regression to obtain scale and reflectively invariant low dimensional approximations of data. What is presented below is one way this might be possible.

For ordinary least squares, one of the standard measures on how well the linear model fits the data is the coefficient of determination *R*^{2}. For the multidimensional generalizations of Eq (10) considered later, a natural measure of fit involves the centroid of the error simplex. To explain what this is for this two dimensional problem, and connect it with *R*, let *α*_{x} and *α*_{y} be the minimizers of *E*_{x} and *E*_{y}, respectively. The centroid in this case is *α*_{c} = (*α*_{x} + *α*_{y})/2. The error measure to be introduced concerns how *α*, the minimizer of *E*, differs from the centroid, relative to the width of the simplex. The resulting formula is

Now, using the law of cosines **x** ⋅ **y** = ||**x**||_{2} ⋅ ||**y**||_{2}cos *θ*, where *θ* is the angle between **x** and **y**, then *α*_{x} = *α*/cos *θ* and *α*_{y} = *α* cos *θ*. Moreover, *R*^{2} = cos^{2} *θ*, which gives a (signed) value of *R* = cos *θ*. Combining these formulas, the result is

Consequently, the better the fit (the closer *R*^{2} is to one), the closer the minimizer of *E* is to the centroid of the error simplex formed from the minimizers of *E*_{x} and *E*_{y}.

## Multilinear regression: Single component approximation

It is now assumed that there are *m* variables, so **p** = (*p*_{1}, *p*_{2}, ⋯, *p*_{m})^{T}. The centered data vectors for each variable are **p**_{1} = (*p*_{11}, *p*_{12}, ⋯, *p*_{1n})^{T}, **p**_{2} = (*p*_{21}, *p*_{22}, ⋯, *p*_{2n})^{T}, ⋯, **p**_{m} = (*p*_{m1}, *p*_{m2}, ⋯, *p*_{mn})^{T}. It is assumed that no two of these vectors are orthogonal.

One of the central components of a PCA is the ability to find low dimensional approximations of the form **p** = *α*_{1}**v**_{1} + ⋯+ *α*_{k}**v**_{k}, where 1 ≤ *k* < *m*. The question is whether something similar can be done for a scale and reflectively invariant approximation. One possibility, which is pursued here, is to use the geometric mean of the ordinary least squares functions.

We begin with a one dimensional subspace, which means that **p** = *α***v**. In what follows this is rewritten as *p*_{2} = *α*_{2}*p*_{1}, *p*_{3} = *α*_{3}*p*_{1}, ⋯, *p*_{m} = *α*_{m}*p*_{1}. The corresponding individual error functions are then

Letting *α*_{1} = 1, the general form of the above can be written as
(12)

After factoring, this can be written in the more compact form (13)

It is worth pointing out that if one of the *E*_{j}’s is zero, then they are all zero.

The resulting error function, which comes from the geometric mean of the *E*_{j}’s, is *E* = (*E*_{1} *E*_{2}⋯*E*_{m})^{1/m}. As stated earlier, this is reflectively invariant because of the symmetric dependence of *E* on the individual error functions.

To make the connection between the minimizer of *E* and the error centroid, the minimizer for each *E*_{j} is needed. Finding them is straightforward. Namely, setting the first partials of *E*_{j} to zero, one finds that
(14)
and, for *i* ≠ *j*,
(15)

Note that these use the stated assumption that the data vectors are not orthogonal. Also, the above expressions hold in the case of when *j* = 1 because *α*_{1} = 1.

Assume now that the data are close to linear, which means that **p**_{i} = *α*_{i0}**p**_{10} + *ε***p**_{i1}, for *i* = 1, 2, ⋯, *m*. Given a value for *j* in Eq (13), the corresponding asymptotic expansions of its minimizing coefficients, for small *ε*, have the form

Note that in the case *i* = 1, the coefficients are *α*_{10} = 1 and *α*_{11} = *α*_{12} = 0. The *α*_{i0}’s are assumed to be known, and the other coefficients are determined by minimizing the error. In preparation for this, note that
where (*i* = *j*, *k*)

From Eq (13) it follows that

Minimizing this determines the *α*_{i1}’s, and this yields
(16)

What is significant about this is that the first two terms in the expansions for the *α*_{i}’s do not depend on *j*. In other words, to *O*(*ε*), the *E*_{j}’s have the same minimizer. Because of this, it immediately follows that through terms of order *ε*, the minimizer of the error function *E* equals the centroid formed from the minimizers of the *E*_{j}’s. Expressed mathematically, if ** α** is the minimizer of

*E*, and

*α*_{j}the minimizer of

*E*

_{j}, then (17)

It also follows from this analysis that, for small *ε*, the error function *E* is strictly convex in the neighborhood of the minimum.

### Example in

As an example, in , for the line with *α*_{2} = 3 and *α*_{3} = 5, 40 randomized points within a distance of 0.2 of the line are used for the data (see Fig 2). For notational simplicity, let *p*_{1} = *x*, *p*_{2} = *y*, *p*_{3} = *z*, *α*_{2} = *α*, and *α*_{3} = *β*. The location of the minimizer of *E* was found using MATLAB’s *fminsearch* command, and the location is shown in Fig 3. Also shown are the locations of the minimizers for *E*_{x}, *E*_{y}, and *E*_{z}, determined by Eqs (14) and (15), as well as the associated triangular region formed by these three points. For comparison, the location of the solution as determined using an unscaled PCA is shown. An important observation coming from this figure is that the minimizer for the scale (and reflectively) invariant error function *E* = (*E*_{x}*E*_{y}*E*_{z})^{1/3} is located near the centroid of the triangle. This is expected because of Eq (17).

The red line is the linear fit determined by minimizing *E* = (*E*_{x}*E*_{y}*E*_{z})^{1/3}.

Vertices of the triangle (simplex) are the locations of the minimizers of *E*_{x}, *E*_{y}, and *E*_{z}.

It is worth having a way to characterize how close the minimizer and centroid are, to provide a measure for the goodness of fit. One possibility is to use the maximum distance relative to the width of the simplex, as given by the formula
(18)
where (*α*_{c}, *β*_{c}) is the centroid, *w*_{α} = max{*α*_{x}, *α*_{y}, *α*_{z}} − min{*α*_{x}, *α*_{y}, *α*_{z}}, *w*_{β} = max{*β*_{x}, *β*_{y}, *β*_{z}} − min{*β*_{x}, *β*_{y}, *β*_{z}}. In these expressions, (*α*_{x}, *β*_{x}), (*α*_{y}, *β*_{y}), and (*α*_{z}, *β*_{z}) are the minimizers of *E*_{x}, *E*_{y}, and *E*_{z}, respectively, and they determine the error simplex in Fig 3. For the solution shown in Fig 3, *C*_{F} ≈ 0.01.

Finally, in finding the minimizer the question of whether or not the error function is convex arises. To verify this, its contour and surface plot are shown in Fig 4. Note that, because of the assumed form of the model function in this example, the error functions *E*_{x}, *E*_{y}, and *E*_{z}, and *E* are undefined when either *α* or *β* are zero. The statement that *E* is convex refers to its dependence on *α* and *β* in the quadrant in which the minimizers are located.

## Multilinear regression: (*m* − 1)-component approximation

The assumptions on the data are the same as for the one-dimensional approximation considered above. As for the model function, it is a hyperplane that is written as *p*_{m} = *α*_{1}*p*_{1} + *α*_{2} *p*_{2}+ ⋯+ *α*_{m−1} *p*_{m−1}. The associated individual error functions are
(19)
where *α*_{m} = 1. With this, the composite error function is
(20)

Letting ** α** = (

*α*

_{1},

*α*

_{2}, ⋯,

*α*

_{m−1}, −1)

^{T}and

**P**= (

**p**

_{1}

**p**

_{2}⋯

**p**

_{m}), then this error function can be written as (21)

To obtain the connection between the minimizer of *E* and the centroid of the error simplex we need the minimizers for the *E*_{j}’s. To determine them, note that for *k* = 1, 2, ⋯, *m* − 1,

Also, from Eq (19), . Consequently,
where *δ*_{kj} is the Kronecker delta. To determine the equation to solve to find the minimizer, consider the case for *E*_{1}. Setting ∇_{α}*E*_{1} = **0**, yields the (*m* − 1) × (*m* − 1) system of equations
(22)
where **A** consists of the first *m* − 1 rows of **P**^{T}**P**. Since **P α** ⋅

**P**=

*α***⋅ (**

*α***P**

^{T}

**P**)

**, then, letting**

*α***r**

^{T}be the

*m*th row of

**P**

^{T}

**P**,

This means that Eq (22) can be replaced with the equation **R α** =

**0**, where

**R**is the matrix obtained by removing the first row from

**P**

^{T}

**P**. In a similar manner, the minimizer for

*E*

_{j}is found by solving the equation obtained by removing the

*j*th row from

**P**

^{T}

**P**. To be more explicit about what equation needs to be solved, for

*E*

_{1}, it is

In general, the coefficient matrix for *E*_{j} is the (*m* − 1) × (*m* − 1) matrix that is obtained by removing the *j*th row and *m*th column from **P**^{T}**P**, and the right hand side is the (*m* − 1)-vector that is obtained by removing the *j*th entry in the *m*th column of **P**^{T}**P**.

To establish the connection between the centroid and the minimizer of *E*, assume that the data are close to planar. Specifically, letting **P** = **P**_{0} + *ε* **P**_{1}, then for small *ε*, ** α** ∼

**α**_{0}+

*ε*

**α**_{1}+

*ε*

^{2}

**α**_{2}+ ⋯, where

**P**

_{0}

**α**_{0}=

**0**. Now, setting ∇

_{α}

*E*=

**0**, the problem to solve is (23)

Also, **A** = **A**_{0} + *ε***A**_{1} + *ε*^{2}**A**_{2} + ⋯, where **A**_{0}, **A**_{1}, **A**_{2} are the first (*m* − 1)-rows of , , and , respectively. With this, the *O*(*ε*) problem that comes from Eq (23) is
(24)
and the *O*(*ε*^{2}) problem is
(25)

In comparison, using the same form for the expansions for *E*_{1}, then one finds from Eq (22) that the *O*(*ε*) equation is the same as the one given in Eq (24). This is also true for the other *E*_{j}’s. Therefore, the **α**_{1} term for the centroid and for minimizer of *E* are equal (as is the **α**_{0} term). As for the *O*(*ε*^{2}) term, for *E*_{1}, one finds from Eq (22) that the problem to solve is
(26)

The equations for the other *E*_{j}’s are the same except for the appropriate modification of the first vector on the right hand side of the equation. Given that *α*_{0} and *α*_{1} are the same for *E* and the *E*_{j}’s, it follows that the *O*(*ε*^{2}) term in the centroid and the minimizer of *E* are equal. Therefore, the conclusion is that the minimizer of *E* and the centroid are equal through terms of order *ε*^{2}. Expressed mathematically, if ** α** is the minimizer of

*E*, and

*α*_{j}the minimizer of

*E*

_{j}, then (27)

### Example in

As before, for notational simplicity, let *p*_{1} = *x*, *p*_{2} = *y*, *p*_{3} = *z*, *α*_{2} = *α*, and *α*_{3} = *β*. The model function can then be written as *z* = *αx* + *βy*. In this case, from Eq (19), the individual error functions are:
(28) (29) (30)

The minimizer *α*_{x} of *E*_{x} is
(31)
the minimizer *α*_{y} of *E*_{y} is
(32)
and the minimizer *α*_{z} of *E*_{z} is
(33)

The resulting error function based on the geometric mean is (34)

Taking the first partials of this, and setting them to zero, one obtains the (nonlinear) system (35) (36)

These equations can be written in somewhat simpler terms by letting

Using the law of cosines, then Eqs (35) and (36) become
(37) (38)
where *θ*_{xy} is the angle between **x** and **y** (with similar definitions for the other angles). An analytical formula for the solution of these equations is not apparent, but it is a simple matter to compute the solution numerically.

As an example, for the plane with *α* = −10 and *β* = 6, 40 randomized points within a distance of 0.15 of the plane were used for the data. The location of the minimizer of *E* was found by solving Eqs (35) and (36) using MATLAB’s *fminsearch* command, and the location is shown in Fig 5. Also shown are the locations of the minimizers for *E*_{x}, *E*_{y}, and *E*_{z}, and the associated triangular region formed by these three points. For comparison, the location of the solution as determined using an unscaled PCA is shown. As expected from Eq (27), the of *E* minimizer is located near the centroid of the triangle. To quantify the difference, using the formula in Eq (18), *C*_{F} ≈ 0.02. Finally, to demonstrate the convexity of the error function in the octant containing the minimizer, the contour and surface are shown in Fig 6.

Vertices of the simplex are minimizers using individual coordinate projections *E*_{x}, *E*_{y}, and *E*_{z}.

## Application to real world data sets

What follows are applications of the hyperplane function to various real world data sets. In the process, comparisons are made with a PCA. Also, three of the data sets have been used in other studies to compare multiple linear regression methods, and this is discussed in the respective application.

### Crime data

As an illustration, and a comparison with a PCA, consider the data for the seven major crime rates and population for the larger cities in the U.S in 2009 [15]. Of the 105 cities reported, 79 were randomly chosen for the training set, and the testing dataset consists of those left out. With this, *n* = 79 and *m* = 8. The minimizer of Eq (20) was found using a modified Polak-Ribière descent procedure, with Armijo’s method used to solve the line search problem. The modification is that if the search direction determined using Polak-Ribière is not a direction of descent, then the steepest descent direction is used instead. A description of the Polak-Ribière and Armijo’s methods can be found in [5, 16]. The starting point for the descent procedure was a convex combination of the minimizers for the *E*_{j}’s, which was computed using the formulas given earlier. Also, the normalization for the PCA uses , for each column *j* of the centered training data matrix.

The resulting testing versus training comparison for each variable is shown in Fig 7. For Eq (20), the *α*_{j}’s are determined by minimizing *E* and then using those same values for each graph. For example, for the graph associated with *p*_{1}, the model function is rewritten as
(39)
and the resulting training-testing values are plotted. For the PCA, a seven component approximation is made. To make a more quantitative comparison between these two approaches, and because both methods are reflectively invariant, the normalized least squares errors for each of these graphs is given in Table 1. The normalization used is *N*||**p**_{j}||_{1}/*n*, where *n* is the number of values in the training set and *N* the number in the testing set. It is seen, at least in this example, that the values using Eq (20) produce a uniformly better result than those obtained using a PCA. To make the point that the fit using a PCA varies with the scaling, the errors using two other scalings are also given in Table 1. The scaling used for PCA_{1} is considerably worse than the PCA values, while the PCA_{2} values in the last column are distinctly better.

Shown are the values obtained using the error function in Eq (20), indicated with the o’s, and using a PCA (x). The dashed line in each graph corresponds when the training and testing data are equal.

The PCA_{1} and PCA_{2} columns are the errors for a PCA using two different column scalings.

### Wine quality data

As a second example, data mining has been used to predict tasting preferences for wine [17]. The dataset for red wine consists of 1599 instances (vectors) containing values for 12 attributes, 11 being physicochemical properties of the wine and one the quality score for taste. The physicochemical properties in this case are: fixed acidity (*α*_{1}), volatile acidity (*α*_{2}), citric acid (*α*_{3}), residual sugar (*α*_{4}), chlorides (*α*_{5}), free sulfur dioxide (*α*_{6}), total sulfur dioxide (*α*_{7}), density (*α*_{8}), pH (*α*_{9}), sulphates (*α*_{10}), and alcohol (*α*_{11}). Following the protocol used in [17], the training set consists of 2/3 of the original (randomly chosen), and the testing set the remaining 1/3. They also used a regression error characteristic (REC) curve, which is defined in [18], to evaluate how well various regression procedures predict the taste score. To compare how Eq (20) does, using the training set, the minimizer is computed with the modified Polak-Ribière method described earlier, treating the data as defining a smooth function. The resulting REC curve determined using Eq (20) is shown in Fig 8. Also shown are the curves obtained using a PCA as well as from using a standard multivariable least squares regression. The dashed curve is what is obtained using Eq (20) if the predicted values are converted back into integer scores as originally used in [17]. Finally, for the two and three dimensional examples considered earlier, it was found that the centroid of the error simplex furnished a fairly accurate approximation for the minimizer. A measure of how close they are is given in Eq (18). Generalizing this formula for the wine example, it is found that *C*_{F} ≈ 0.15, which indicates they are reasonable close.

The dashed red curve is the curve using Eq (20) after converting back to the original integer scale reported in [17].

It is evident from Fig 8 that standard least squares provides better predictive values for the taste score than when using Eq (20). This can be quantified using the mean absolute deviation (MAD), which is defined as ||**y** − **y***||_{1}/*N*, where **y*** are the test values, **y** are the predicted values, and *N* is the number of observations in the testing set (note that the values are unscaled). For Eq (20) the MAD value is 1.5, while using least squares it is 0.5. Consequently, if given a particular bottle of red wine, least squares would provide a better predictor of how it tastes. What Eq (20) provides is a better model for how to modify the physicochemical properties to achieve a particular taste score, and the reason is reflective invariance. To illustrate, with the coefficients computed previously, the resulting MAD values for the other possible training-testing cases are given in Table 2. The fact that the least squares values are so poor is not surprising, and the reason was given earlier. Namely, the values obtained using *p*_{m} as the dependent variable are not equivalent to the values obtained if the model equation is solved for, say, *p*_{1}, and then considering it as the dependent variable.

The last column is the ratio of the least squares value to the one obtained using Eq (20).

### Wave height data

Wave height at Buoy Station 46006, located in the northern Pacific Ocean, is measured hourly, along with the wind direction, wind speed, wind gust, dominant wave period, average wave period, barometric pressure, and water temperature [19]. The specific data considered here come from measurements over approximately 11 months. The dataset consists of 7960 instances containing values for the eight attributes. A reduced version of this dataset was examined in [20], in comparing how various regression procedures do on real ecological data. The reduction was to use the daily average values rather than the hourly data, but all of the values are considered here. Following the protocol used in [20], the training set consists of 3/4 of the original (randomly chosen), and the testing set the remaining 1/4. The comparison was made using the mean squared prediction error (MSPE), which is defined as , where **y*** are the test values for the wave height, **y** are the predicted values, and *n* is the number of observations in the testing set (note that the values are unscaled). Using Eq (20), the MSPE is about 1.2, which compares favorability with the mean MSPE of 12.3 found in [20]. The resulting MAD value is about 0.8, and the REC curve is shown in Fig 9. Finally, for this example, *C*_{F} ≈ 0.16. This indicates that the centroid of the error simplex and the minimizer are close, although not as close as in the earlier examples.

### Chemical reaction data

The chemical oscillator known as the Belousov-Zhabotinskii reaction can be described with the following five reactions [21]

The chemicals here are bromous acid (X), bromide (Y), cerium-4 (Z), bromate (A), and a product P. Given known initial concentrations for each species, measurements are made at later times to follow the evolution of the overall reaction. The complication is that one or more of these reactions are very fast, or occur at very low concentrations, so accurate measurements are difficult. What is demonstrated here is how the hyperplane fit can be used in the case of when four of the five concentrations are measured, and the fifth is determined using regression. What is important is that no matter which four are chosen, that the same (equivalent) value is obtained for the fifth species. To demonstrate, the reactions were run for 20 different initial concentrations, and the values for the five species were recorded at 60 second intervals up to 10 minutes. The resulting dataset consists of 180 instances, and these were randomly split into a training set (3/4) and a testing set (1/4). The values fitted are those for *Z*. Using Eq (20), the MAD value is 0.015 and the MSPE is 4.6 × 10^{−4}. Using a standard least squares fit the values are 0.012 and 3 × 10^{−4}, respectively. However, if you fit the values for, say, *A*, and then transform back to the equation for *Z*, then the MAD and MSPE values using Eq (20) remain unchanged while the least squares values are 0.024 and 10^{−3}. As a final comment, using a reflectively invariant error function with chemical density fits has the distinct advantage of being able to determine possible conservation laws inherent in the system.

### Chlorophyll-*a* data

The link between phytoplankton and water chemistry has been the subject of several recent studies, although the connections with specific chemical species are incompletely understood. Several correlations have been made, and the physicochemical parameters most commonly considered are oxygen, pH, NH_{4}-N (ammonium nitrogen), NO_{3}-N (nitrate nitrogen), and PO_{4}-P (phosphate phosphorus) [20, 22–24]. The latter study used the values for the Chlorophyll-*a* density, along with various chemical properties, of lakes in the Northeast that were measured over a four year period [25]. Altogether, there are 20 useable variables in this study, and 500 observations. After removing incomplete entries and others that are not useable, the data set consists of 348 observations. The model requires specification of which variables to use, and after some analysis of the data it comes down to five: total dissolved aluminum, nitrate, total nitrogen, total phosphorous, total suspended solids, and turbidity. In comparison, in [20], 10 variables were initially used. Using Eq (20), the resulting MSPE is about 1.1, which compares favorability with the mean MSPE of 2.4 found in [20]. The resulting normalized MAD value is about 0.7, and the REC curve is shown in Fig 10. Finally, for this example, *C*_{F} ≈ 0.13. This indicates that the centroid of the error simplex and the minimizer are close, similar to what was found for the wave height example.

## Other dimensional fits

It is possible to write down the formulas for the other cases, but it is more informative to consider a particular case that illustrates how this is done. So, consider the case of when there are four variables, *x*, *y*, *z*, and *w*. There are three subspace approximations possible, corresponding to one, two and three dimensions. The one and three dimensional cases were discussed above, and so only the two dimensional case is considered. Writing the model functions as *z* = *α*_{11}*x* + *α*_{12}*y*, *w* = *α*_{21}*x* + *α*_{22}*y*. This can be written in matrix form as

It is possible to rewrite the above equation in five different, but equivalent, ways. For example, if *z* and *w* are taken to be independent then
while if *y* and *z* are taken to be independent then

The entrees in the above matrices are known expressions involving the original coefficients *α*_{11}, *α*_{12}, *α*_{21}, and *α*_{22}. For example,

The error function corresponding to the variable *x* is

The three terms in the above sum correspond to the case when *x* is taken to be dependent. In a similar manner,

## Concluding remarks

The principal conclusions from this study are:

- Scale and rotational invariance of the error function are incompatible.
- Using the geometric mean of the ordinary least squares error functions, one obtains an error function which is scale and reflectively invariant, and which is easily extendable to low dimensional approximations for multilinear regression. For the two cases worked out, which correspond to a line and hyperplane approximation, the minimizer can be well approximated using the centroid of the error simplex obtained from the minimizers for the ordinary least squares error functions.

Because the error function used here is not quadratic, finding the minimizer requires using a nonlinear optimization procedure. The result is that more computational time is needed than for linear least squares. How much time depends on the size of the data matrix, and the number of variables involved. For the wine example considered earlier (12 variables and 1599 data vectors), linear least squares using MATLAB’s *mldivide* routine takes about 1 msec, while the nonlinear optimization procedure takes about 100 msec (using a 2017 iMac). So, although the relative time is fairly large, the actual time is small. The proposed error function does have the advantage of having a warm start, which is the centroid approximation, and this helps reduce the computing time.

In terms of future work, it remains to determine if the centroid approximation applies to the other lower dimensional approximations. Also, there is the question as to the sensitivity of the minimizer to outliers in the training set. This is a problem for a PCA, and one approach to improve the robustness of a PCA is to switch from a *ℓ*_{2}-norm to a *ℓ*_{1}-norm [26, 27], a *ℓ*_{p}-norm [28], or a norm based on a generalized mean [29]. These are used, in part, because they preserve the rotational invariance of a PCA. It is straightforward to use these norms with the geometric mean function, and this will not affect its invariance properties. What has not been investigated is the sensitivity of the proposed error function to outliers, or whether switching norms might reduce any potential sensitivities.

## References

- 1. Belhumeur P, Hespanha J, Kriegman D. Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell. 1997;19(7):711–720.
- 2. Turk M, Pentland A. Eigenfaces for Recognition. J Cogn Neurosci. 1991;3(1):71–86. pmid:23964806
- 3. Ross DA, Lim J, Lin RS, Yang MH. Incremental Learning for Robust Visual Tracking. Int J Comput Vis. 2008;77(1):125–141.
- 4. Ramsey F, Schafer D. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning; 2012.
- 5.
Holmes MH. Introduction to Scientific Computing and Data Analysis. New York: Springer; 2016.
- 6.
Henderson C, Izquierdo E. Multi-scale reflection invariance. In: 2016 SAI Computing Conference; 2016. p. 420–425.
- 7. Tofallis C. Multiple Neutral Data Fitting. Annals of Operations Research. 2003;124(1):69–79.
- 8.
Goodman T, Tofallis C. Neutral Data Fitting by Lines and Planes. In: Iske A, Levesley J, editors. Algorithms for Approximation. Springer Berlin Heidelberg; 2007. p. 259–268.
- 9. Woolley EB. The Method of Minimized Areas as a Basis for Correlation Analysis. Econometrica. 1941;9(1):38–62.
- 10. Samuelson PA. A Note on Alternative Regressions. Econometrica. 1942;10(1):80–83.
- 11. Babu GJ, Feigelson ED. Analytical and Monte Carlo comparisons of six different linear least squares fits. Commun Stat Simul Comput. 1992;21:533–549.
- 12. Ludbrook J. Linear regression analysis for comparing two measurers or methods of measurement: But which regression? Clin Exp Pharmacol Physiol. 2010;37(7):692–699. pmid:20337658
- 13. Draper NR, Yang Y. Generalization of the geometric mean functional relationship. Comput Statist Data Anal. 1997;23(3):355–372. http://dx.doi.org/10.1016/S0167-9473(96)00037-0
- 14.
Bluman G, Cole J. Similarity Methods for Differential Equations. New York: Springer-Verlag; 1974.
- 15.
U S Census Bureau. Crime Rates by Type: Selected Large Cities. In: Statistical Abstract of the United States: 2012. 131st ed. Washington, DC; 2012.
- 16.
Nocedal J, Wright S. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer; 2006.
- 17. Cortez P, Cerdeira A, Almeida F, Matos T, Reis J. Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst. 2009;47(4):547–553. http://dx.doi.org/10.1016/j.dss.2009.05.016
- 18.
Bi J, Bennett K. Regression error characteristic curves. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003). Washington DC; 2003. p. 43–50.
- 19.
NOAA. National Data Buoy Center: Historical Meteorological Data—Station 46006; 2006.
- 20. Murtaugh PA. Performance of several variable-selection methods applied to real ecological data. Ecol Lett. 2009;12(10):1061–1068. pmid:19702634
- 21. Field RJ, Koros E, Noyes RM. Oscillations in chemical systems. II. Thorough analysis of temporal oscillation in the bromate-cerium-malonic acid system. J Amer Chem Soc. 1972;94:8649–8664.
- 22. Heini A, Puustinen I, Tikka M, Jokiniemi A, Leppäranta M, Arvola L. Strong dependence between phytoplankton and water chemistry in a large temperate lake: spatial and temporal perspective. Hydrobiologia. 2014;731(1):139–150.
- 23. George B, Kumar JIN, Kumar RN. Study on the influence of hydro-chemical parameters on phytoplankton distribution along Tapi estuarine area of Gulf of Khambhat, India. The Egyptian Journal of Aquatic Research. 2012;38(3):157–170. http://dx.doi.org/10.1016/j.ejar.2012.12.010
- 24. Liu Y, Guo H, Yang P. Exploring the influence of lake water chemistry on chlorophyll a: A multivariate statistical model analysis. Ecol Modell. 2010;221(4):681–688. http://dx.doi.org/10.1016/j.ecolmodel.2009.03.010
- 25.
Stoddard J. EMAP Surface Waters Lake Database: 1991-1994 Northeast Lakes Water Chemistry Data Summarized by Lake. U.S. EPA NHEERL Western Ecology Division; 1996.
- 26.
Ding CHQ, Zhou D, He X, Zha H.
*R*_{1}-PCA: rotational invariant*L*_{1}-norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd International Conference on Machine Learning, ICML’06; 2006. p. 281–299. - 27. Kwak N. Principal Component Analysis Based on L1-Norm Maximization. IEEE Trans Pattern Anal Mach Intell. 2008;30(9):1672–1680. pmid:18617723
- 28.
Kwak N. Principal Component Analysis by
*L*_{p}-Norm Maximization. IEEE Trans Cybern. 2014;44(5):594–609. pmid:23807479 - 29. Oh J, Kwak N. Generalized mean for robust principal component analysis. Pattern Recognit. 2016;54:116–127. https://doi.org/10.1016/j.patcog.2016.01.002