Variance Inflation Factor-Based Forward-Selection 2 Method for Water-Quality Estimation via Combining 3 Landsat TM , ETM + , and OLI / TIRS Images and 4 Ancillary Environmental Data

A simple approach to enable water-management agencies employing free data to achieve 16 the goal of using a single set of predictive equations for water-quality retrievals with satisfactory 17 accuracy is proposed. Multiple regression-derived equations based on surface reflectance, band 18 ratios, and environmental factors as predictor variables for concentrations of Total Suspended 19 Solids (TSS), Total Nitrogen (TN), and Total Phosphorus (TP) were derived using a hybrid 20 forward-selection method that considers Variance Inflation Factor (VIF) in the forward-selection 21 process. Landsat TM, ETM+, and OLI/TIRS images were jointly utilized with environmental 22 factors, such as wind speed and water surface temperature, to derive the single set of equations. 23 The coefficients of determination of the best-fitting resultant equations varied from 0.62 to 0.79. 24 Among all chosen predictor variables, ratio of reflectance of visible red (Band 3 for Landsat TM and 25 ETM+, or Band 4 for Landsat OLI/TIRS) to visible blue (Band 1 for Landsat TM and ETM+, or Band 26 2 for Landsat OLI/TIRS) has a strong influence on the predictive power for TSS retrieval. 27 Environmental factors including wind speed, remote sensing-derived water surface temperature, 28 solar altitude, and time difference (in days) between the image acquisition and water sampling 29 were found important in water-quality parameter estimation. 30


Introduction
Continuous monitoring of water quality is essential for the health and welfare of the people and ecosystems reliant upon them.Urbanization, agriculture, and other anthropogenic factors can alter water quality [1], and waiting to remediate until a change is clearly visible can be much more costly than early prevention.Despite this, the cost of adequate temporal and spatial physical measurements can potentially be prohibitive [2].For example, the United States Geological Survey (USGS) regularly monitors water quality in Lady Bird Lake in Austin, Texas, USA; however, the frequency is only approximately twice per year at a single point near the outlet over the past decade [3].Additionally, in situ measurements from year to year do not occur in the same months.As a result, it is difficult to distinguish whether a change in the water quality measured at a point is truly a long-term change or the result of a seasonal difference or recent event (e.g., a large precipitation event) [4].Additionally, it is impossible to evaluate the spatial variation in water quality from single-point measurements.
In recent decades, remote sensing has provided an alternative method for monitoring water quality in a spatially synoptic manner at a lower cost compared with extensive in situ measurement.
Each water-column constituent exhibits a specific spectral response that can be observed by satelliteand aircraft-mounted remote sensors [5].Suspended sediment usually exhibits strong backscattering of incident light [5], where the actual color depends on the terrestrial origin [6].Colored dissolved organic matter (CDOM) is composed of algae, yellow substances, and organic plumes [5], and entails a broad-band solar-induced fluorescence over 490-530 nm [6].Phytoplankton exhibits a volume reflectance (and water-leaving radiance) peak due to chlorophyll-a, with a well-defined Gaussian distribution around 685 nm [6].
For a particular wavelength, λ, the spectral radiance from the water observed vertically, known as the upwelling radiance, Lu, is given by where is the radiance reflected/backscattered by the water column, in-water constituents, and the bottom if the water column is optically shallow; is the skylight radiance; and Ω is the ratio of radiance directly reflected by the water surface to [7].Note that the radiance observed by a satellite is composed of , plus atmospheric interference; therefore, it requires atmospheric correction (discussed below)., , and Ω are influenced by a variety of factors.If the water column is sufficiently deep, bottom reflectance may be ignored, and can be assumed to be a measure of the effects of water-column constituents alone.Atmospheric conditions (e.g., clear, cloudy, overcast) affect both Ω and , whereas Ω can be further affected by wind speed in the form of surface ripples [7].Wind speed has also been found to have some influence on water clarity [8].
Because of their higher capability to penetrate the water column, visible bands have conventionally been used to estimate water quality [5].In addition, infrared bands have also shown significance in determining water-quality parameters in some studies [9,10].However, only near infrared wavelengths were used in these studies.Thermal infrared bands have not extensively been used in water-quality estimation.
Site-specific predictive models can be created to relate a number of band radiance measurements or derived reflectance values [5] to the water-quality parameter of interest by fitting the model to in situ water-quality measurements.Multiple regression analysis and artificial neural networks (ANNs) constitute two methods that are frequently used to generate such predictive models [5,10,11,12].
In academia, satellite remote-sensing images have been increasingly available for water-quality determination.However, the popularity of this approach has not been extended to decision making by management agencies in general [13].According to Schaeffer et al. [13], the reasons for this phenomenon include cost, product accuracy, data continuity, and programmatic support.
Cost is always a major constraint, as many water-management agencies have limited budgets [13].Even though there are many free remote-sensing data sets available, such as the multispectral satellite images available from the Landsat program (e.g., Landsat Thematic Mapper (TM), Enhanced Thematic Mapper Plus (ETM+), and Operational Land Imager (OLI)/ Thermal Infrared Sensor (TIRS)) [14], MODIS [15], SeaWiFS [16], etc., the selection of images is predominately limited to moderate spatial-resolution images from the Landsat program, for example, for terrestrial pond/lake applications due to the finer spatial resolution of those data relative to other free remote-sensor image sources and the relatively small sizes/spatial extents of such features.Another aspect of the cost constraint is the cost to collect field water-sampling data, as the creation of empirical predictive models necessitates in situ water-quality data.Sometimes, due to cost, logistical, and other constraints, that means that a water-management agency can only resort to free water-quality data, such as those made available by the USGS.The downside, as noted above, is that spatio-temporal sampling density/data availability may be low.This drawback seriously limits the ability of a water-management agency to utilize free Landsat program data, for example, as the basis of a water-quality monitoring program since the satellite images and corresponding in situ measurements must be acquired in a temporally proximal manner [17].
As a result, water-management agencies that resort to only using free remote-sensing resources often only have access to a limited number of useable satellite images for water-quality monitoring.
Such a scenario often leads to the use of a single predictive model to determine water-quality information from satellite images.Nevertheless, many studies divide their analyses by season [18,19] due to systemic seasonal differences in factors such as concentrations of color-producing substances (including phytoplankton), atmospheric disturbances [19], and solar zenith angle [20].Some studies have shown that the predictive power of equations created without distinguishing by season is lower than it otherwise would be [21,22].
Since the derived predictive equation is seasonally affected by the environment, a few studies have incorporated the influencing factors into predictive equation generation.One example is with the estimation of chlorophyll-a concentration.It has been known that phytoplankton growth is statistically significantly dependent on water temperature [23,24].Incorporating water temperature (derived from the satellite remote-sensor thermal band) in development of predictive equations has proven to be helpful in determining chlorophyll-a concentration [25].However, this approach has not been investigated extensively.In this study, we consider additional environmental factors based on energy fluxes between a waterbody and the atmosphere.We posit that including these environmental factors in predictive equations not only increases prediction accuracy, but also facilitates the usage of a single set of predictive equations throughout different seasons.The direct benefit is that one can pool all observation data in creating equations, thus resulting in higher predictive power.
Programmatic support is also important to water-management agencies, according to Schaeffer et al. [13].In most cases, local universities should be sufficient in providing support to water-management agencies.However, we posit that the methodology adopted for generating predictive models should entail model construction in a stepwise manner, such that most people with basic training could implement such methods can follow without much difficulty.For this reason, in choosing methodology implemented by water-management agencies, simple and well-understood methods such as multiple regressions should be weighed over more complex methods, such as ANNs.
Product accuracy is another major concern expressed by the water-management agencies [13].
Even though water-management agencies could utilize predictive models from peer-reviewed journals, such models may not yield high-accuracy estimates in a given application.Multiple regression analysis has been employed in many studies for its ease of application.However, for applications using this method, overfitting from multicollinearity can be a serious concern.
Multicollinearity means that some of the explanatory variables in the multiple regression model are dependent on one another.The direct result from multicollinearity is that the standard error of coefficients of explanatory variables is inflated, which means that coefficients of the derived model are not reliable.Unfortunately, many past studies neither discussed the issue of multicollinearity, nor provided results of validation of the derived regression models [4,5,9,17,26,27,28,29].A common way to identify multicollinearity of a model is through the usage of indicators such as Akaike's Information Criteria [30], Mallow's Cp [31], PRESS [32], etc.However, such indicators apply to the whole model so all possible subsets of explanatory variables must be examined, and this approach becomes unattainable when the number of variables increases [33].
Other popular methods to identify multicollinearity include the deployment of a principal component analysis (PCA) or structural equation modeling (SEM) [33].PCA creates orthogonal principal components, which are linear combination of variables, and a regression model can be created based on the orthogonal components in order to eliminate multicollinearity completely.
Some studies show, however, that this methodology can result in a loss in explanatory power.
Additionally, the main limitation of the PCA approach is rooted in the physical interpretation of the principal components.On the other hand, SEM accepts the existence of collinearity among explanatory variables and hypothesizes that a model exists among variables.Then all possible combinations of causal links among variables are tested against the hypothesized model.Since SEM is not an exploratory technique, SEM is prone to inferential errors made during development and selection of the hypothetical models [33].
We propose utilizing the variation inflation factor (VIF) to minimize multicollinearity.Unlike other indicators described above, VIF is calculated for each predictor variable.VIF has been used in the field of remote sensing on a limited basis to check multicollinearity of results [34; 35] Based on the gaps in the research literature illustrated above, the objectives of this study were: 1. Incorporate environmental factors (such as temperature, wind speed, etc.) into a single set of predictive equations for remote-sensing water-quality parameter estimation; and 2. Increase model predictive power for a limnological water-quality parameter-estimation application by considering the effect of multicollinearity in model creation.
The goal of this study is to address all four concerns of utilizing satellite data in decision making by water-management agencies-i.e., cost, product accuracy, data continuity, and programmatic support.This study provides water-management agencies with a simple, easy-to-follow methodology for utilizing free observation data (from Landsat program, USGS, etc.) in order to address cost and programmatic-support issues for water-quality monitoring.The Landsat program guarantees long-term data continuity.The proposed methodology provides a single set of predictive equations; accuracy is maintained because all available data are consolidated for the creation of a single model.Also, consideration of multicollinearity increases the likelihood for acceptable estimation accuracy of the derived model in future water-quality parameter retrieval applications.

Study Area
The population of City of Austin, Texas, USA has increased dramatically in recent decades, from 252,000 in 1970 to 926,000 in 2016 [42].With significant population growth comes an increase in impervious area, higher runoff and lower water quality in local water bodies.Lady Bird Lake (formerly Town Lake), situated near the city center, provides an opportunity to remotely monitor water quality in an urban watershed (Figure 1).The lake, formed by damming the Colorado River, is  The USGS maintains a number of water-quality sampling stations on Lady Bird Lake, but only four of them, EC, DC, CC and AC (Figure 1), monitor the water-quality constituents of interest in this study within the time frame of available satellite images (i.e., 1983-2015) [3].Table 2 provides basic information for these four sampling stations, including summary statistics for these water-quality quantities of interest-total suspended solids (TSS), total nitrogen (TN), and total phosphorus (TP)-derived from water-quality samples collected at a depth of 1 m.Secchi disc transparency, a pseudo-measure of turbidity, was measured in four locations when the samples of Table 2 were taken (Table 3).Secchi disc depths were much shallower than the average bottom depth of the lake (6 m); thus, bottom reflection is not observable from above the air-water interface for these cases.Therefore, contribution of bottom reflectance to the water-leaving radiance (Equation 1) can be ignored.

Selection of Satellite Images
Selection of Landsat TM, ETM+, and OLI + TIRS images [45] was based on several criteria.
Images selected were cloud-free and were acquired within seven days of in situ water-quality measurements in Lady Bird Lake [10,17].In order to minimize the effects of spatio-temporally-close rainfall events, only images that entailed daily precipitation depths less than 1.25 cm (0.5 inch) observed between the dates of the selected images and their associated water-sampling dates ( Surface reflectance values, ρ, corrected for path radiance, were derived using Fast Line-of-sight Atmospheric Analysis of Spectral Hypercube (FLAASH®) radiative transfer model [48,49].Pixel values were converted from a digital number (DN) to spectral radiance (L) following the Landsat Data User Handbook [41], and then corrected to surface reflectance using FLAASH.Note that where L is the spectral radiance observed by the sensor; ρ is the "correct" surface reflectance for the pixel of interest; ρe is the average surface reflectance from the pixel of interest and the surrounding region; S is the spherical albedo of the atmosphere; Lα is the path radiance backscattered by the atmosphere; and A and B are coefficients dependent upon atmospheric and geometric conditions.The distinction between ρ and ρe accounts for adjacency/spatial-mixing effects [48,49].

Image Pre-processing
Because water bodies such as Lady Bird Lake are often generally spectrally dark targets [50], remote-sensing reflectance from such areas is usually lower than the surrounding urban areas.With FLAASH, significant errors can occur when strong albedo contrasts exist among the materials in the scene [49].To minimize this potential problem, a land mask was created and applied in order to exclude all surrounding land regions [51], leaving just the aquatic areas (i.e., Lady Bird Lake) for subsequent atmospheric-correction processing.

Determination of FLAASH Parameter Values
Two of the parameters required by FLAASH are: visibility and choice of atmospheric model.
Visibility obtained from historical airport records [52] caused FLAASH to over-compensate in its correction of atmospheric effects and yield negative reflectance values.Therefore, the 2-band (K-T) aerosol retrieval method [49] with "urban" setting was used to estimate visibility.Ideally, selection of an atmospheric model is based on one of the following options, presented in order from most preferred to least preferred: known standard column water vapor amount, expected surface air temperature, or tabulated seasonal-latitude combinations [49].Although there are atmospheric water-content products available [53]  FLAASH should not be applied to thermal bands [49]; therefore, another atmospheric-correction method was applied to thermal bands.In particular, the single-band atmospheric-correction method described by Barsi et al. [54] was used.The methodology calculates atmospheric transmission and path radiance using MODTRAN [49], based on the atmospheric profiles generated by National Centers for Environmental Prediction (NCEP).Equation 3provides the relationship between top-of-atmosphere radiance (LTOA), the target radiance of kinetic temperature T (LT), the path (upwelling) radiance (Lu), and the sky (downwelling) radiance (Ld): In Equation 3, atmospheric transmission τ, path radiance Lu, and sky radiance Ld were obtained from the on-line calculator based on the atmospheric correction method of Barsi et al. [54].Since water is a near-perfect blackbody, emissivity (ε) was set as 1 in this study.Emissivity and transmission are unitless, whereas radiance values are in units of W/m2•sr•μm.
The atmospheric profiles are only available after January 2000.For satellite images acquired prior to that, atmospheric profiles from "surrogate dates" in 2000 were used in this study.The surrogate date has nearly identical daily precipitation, temperature, and wind speed as the satellite image date.By choosing a surrogate date in such a manner, the atmospheric condition of the actual satellite image date and the surrogate date are expected to be similar.If more than two surrogate dates were found based on the above criteria for one satellite image, the one that is temporally closest to the date in the year in which a given the satellite image was acquired was chosen.Table 6 provides the list of the satellite image dates, the corresponding surrogate dates, and daily meteorological parameters for both of them.Target temperature (i.e., water surface temperature) was then derived after atmospheric correction according to equations provided in the Landsat Data User Manual [41].For Landsat ETM+, the low-gain channel was used because the signal reflected from water-column constituents entail low signal strength.For Landsat TIRS, only band 10 was used because data from band 11 have been contaminated by a stray-light effect, and a remedy has not yet been found [55].Bands 10 and 11 here are band numbering from Landsat TIRS.

Post-processing for Atmospherically-Corrected Surface Reflectance
Surface reflectance values at the water-quality stations were extracted from the FLAASH-corrected satellite images.Pixels located at the exact coordinates of the respective water-quality sampling stations are not necessarily the ideal pixels for which reflectance values should be extracted.Reasons for this include the geometric-offset error between the map coordinates of a pixel and actual corresponding in situ sampling planimetric locations; random surface debris; light unpredictably scattered or reflected into the instantaneous field-of-view (IFOV) of the sensor or onto the aquatic area of interest [56]; and optically shallow water near the sampling stations, yielding potentially confounding issues associated with bottom reflectance.To compensate for this, the search range was expanded to 90 m (i.e., a search neighborhood comprised of 3 x 3 image pixels, centered around the pixel located at the station coordinates).The pixel within this zone with the lowest value in band 5 was considered to contain the most information regarding water-column constituents [57].If two pixels had the same band 5 values, the pixel closest to the coordinates of water-quality sampling location was selected.

Multiple Regression Analysis
Multiple regression equations were derived to predict constituent concentrations (TSS, TN, and TP, i.e. the dependent variables) from the predictor variables, such as band reflectance.The procedure for selection of predictor variables is delineated below.
The spectral bands and associated band ratios were all chosen as candidates for independent variables.Band ratios were included as independent variables in the regression analysis [10] because they are less apt to be influenced by lighting conditions [56].
Radiance data from the thermal bands (band 6 of Landsat TM and ETM+, and band 10 of Landsat TIRS) were converted to water surface temperature.As discussed, water temperature has been found to be related to phytoplankton concentration [23,24], and thus, related to water quality [58].However, in this study, most of the satellite image dates differ by several days compared with the closest corresponding actual water-quality sampling date; thus, the water surface temperature derived from the satellite images does not represent the actual water temperature at the time of water sampling.
Equation 4 considers the net energy fluxes between a waterbody and the atmosphere [59]: where NET is the net energy flux, SWRnet indicates the net short-wave radiation energy flux (Equation 5), LWRnet indicates the net long-wave radiation flux (Equations 6 and 7), LHF is the latent heat flux (Equation 8), and SHF is the sensible heat flux (Equation 9).These terms are calculated by the following equations [59]: and ≈ 1.61(1 − + 0.0019 ) where is the surface albedo (usually very low for water so SWRnet ≈ SWRdown), ε is the surface emissivity (≈ 0.97), σ is the Stefan-Bolzman constant, Ts is the water surface temperature, Ta is the air temperature, ea is the surface vapor pressure, C is the cloud cover index (Equation 7), SWRcs is the clear-sky short wave radiation, n is the noon solar altitude, is the density of air, Le is the latent heat of evaporation, Ce is the turbulent exchange coefficient for latent heat, U is the wind speed, Qs and Qa are saturation specific humidity at the surface and at near-surface atmosphere, respectively, and Ch is the turbulent exchange coefficient for sensible heat.
Some of the variables in Equations 5 to 9 are known or can be reasonably assumed as constants (such as , ε, σ, , Le, Ce, and Ch [60]).The surface vapor pressure, ea, is dependent on water surface temperature [61].Qs and Qa are both dependent on temperature as well [62].The air temperature and noon solar altitude (Ta, and n respectively) can be obtained from the historical observation record.The water surface temperature Ts is obtained from thermal band data.That leaves only one variable unknown, which is the clear-sky short wave radiation SWRcs.Calculating SWRcs involves a complex procedure [63] so it is difficult to associate it with distinct environmental factor(s); thus, we did not consider it in evaluating heat flux in this study.
Assuming that the temperature change between the image date and the water-sampling date directly corresponds with the cumulative heat flux between the dates, the following variables are needed in order to account for the temperature change between the image-acquisition date and the water-sampling date [52]: 1. Time offset (in days) between the image date and the water-quality sampling date (positive offset means that the image date is later than the sampling date); 2. Water surface temperature (in K) derived from the thermal band; Instantaneous temperature and wind speed were interpolated from the hourly historical data [52].And further considering Equations 5 to 9, the full list of variables considered in the multiple regression process is provided in Table 7.A look-up table between variable abbreviations and variable descriptions is provided as Table 8.As described above, in this study, the band number is based on band-numbering scheme for TM and ETM+.Selection of predictor variables is based on a hybrid forward selection that considers the variation inflation factor (VIF).In conventional forward selection, variables are added to the regression one at a time, starting with no predictor variables being selected.The p-value threshold includes a predictor in the regression equation if its p-value is below a "probability to enter," and includes a predictor that will most improve the fit first (i.e., "forward").A default value of 0.25 in JMP [64] was used for "probability to enter." In addition to p-value, the variation inflation factor (VIF) was used to minimize multicollinearity of the model.Multicollinearity occurs when a predictor variable is a linear combination of other predictor variables in the model.The direct consequence of multicollinearity is that the error variance is inflated, which may result in low prediction power if the overfitted model is used with a new set of data.VIF is calculated as: where is the multiple coefficient of determination between the j-th predictor variable of interest and the rest of the predictor variables.The rule of thumb to avoid serious multicollinearity is that all chosen predictor variables should have VIF less than 10 [65].Unlike other criteria such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Mallow's Cp, VIF is generated for each predictor variable.Also, VIF has a suggested absolute criterion, whereas other criteria (AIC, BIC, Cp, etc.) provide only relative comparison between models.
We propose a novel approach that considers VIF while adding variables in forward selection.
When a variable is added according to the rules of forward selection, VIFs of all included variables (including the one that is just added) are also checked.If VIFs are all below the threshold of 10, the newly-added variable is allowed, and the next variable is chosen according to the rule of forward selection.However, if any VIF is found to be larger than the threshold for any of the variables, the most recently-added variable is deleted and the selection procedure stops.Coefficients of variables, p-values, and VIF are dynamically recalculated when any variable is deleted from the model.The procedure is illustrated in Figure 2. The package CARET in the software R [67] was used to perform validation by bootstrapping and LOOCV.For bootstrapping, 1000 trials were specified.

Results
The best-fitting regression equations chosen by the hybrid forward selection for each water-quality constituent (TSS, TN and TP) are provided in Table 9.The results in Table 9 include the predictor variables, importance of the predictor variable, associated regression coefficients and standard error, 95% confidence intervals for the regression coefficients, p-values, and VIF values for each of the response variables (TSS, TN, and TP).The importance values are calculated by dividing the change in R 2 when the variable of interest is dropped from the model by the overall R 2 when the variable of interest is included [68].The sum of importance values of all variables does not equal to 1 since the importance is relative only.Note that the response variables are transformed to obtain a better regression fit, and the band-numbering convention is based on TM and ETM+ band numbers, as discussed previously.
Table 9. Best fitting multiple regression models for TSS, TN and TP using the hybrid forward selection considering VIF.The resulting multiple regression-based models are provided in Equations 11 to 13:

Confidence
Plots of the observed versus predicted concentrations of TSS, TN, and TP are plotted in Figures 3, 4, and 5, respectively.The 1:1 line is added to all three figures.The derived multiple linear predictive equations were validated by bootstrapping and LOOCV.
The results are shown in Table 10.Table 10 shows that the equations have satisfactory predictive power for future, unknown data since validation R 2 values are above 0.5, except for weaker results regarding TN [69].

Discussion
The multiple linear equations derived from the regression analysis indicate that weather-related variables play an important role in predicting water-quality parameters.In fact, many weather variables bear more importance than the multispectral variables do.The relative importance of each variable is provided in Table 9.If all the weather variables are stripped from Table 9, the predictive variables related to Landsat bands alone provide only coefficients of determination, R 2 , of 0.53, 0.26, and 0.36 for TSS, TN, and TP, respectively.
Kloiber et al. [10] found that both B1 and the ratio B3/B1 can be used to predict the Secchi disk transparency, which is closely related to TSS.From Kloiber et al. [10], the regression model containing B3/B1 and B1 predicted Secchi disk transparency with R 2 of 0.75.We also found B3/B1 as the dominant important variable in determining TSS concentrations, but did not find B1 as one of the significant prediction variables.Kloiber et al. [10] accrued a higher R 2 than our study since Kloiber et al. limited their in situ data collection to ±1 day from the corresponding satellite image acquisitions.
In the current study, the predictive equation that includes B3/B1 alone has a R 2 of 0.53 for TSS because our available data only allows in situ samples to be ±7 days from satellite image acquisitions.
Considering weather variables successfully boosted R 2 to 0.68, such that it was comparable with that of Kloiber et al. [10] (i.e., 0.75).
For TSS, we found the instantaneous wind speed, W, to be an important prediction variable.
Since the instantaneous wind speed is chosen, instead of the daily mean wind speed between the image date and the water-quality sampling date (Wmean), it indicates that the instantaneous effect of wind (such as the surface ripple effect) is more important to TSS determination than the long-term heat-exchange effect.Even though the difference between the water surface temperature and the daily mean air temperature between the image date and the water-quality sampling date is selected as one of the prediction variables, it is of little importance in the model.It was chosen because the default forward-selection method has a lenient inclusion criterion (p = 0.25).
Dewidar and Khedr [9] determined that the band ratio B2/B1 is important in determining the TN concentration in brackish lagoons.However, the correlation between B2/B1 and TN was low in Dewindar and Khedr [9], with a correlation coefficient of 0.298.B2/B1 was also chosen by this study as one of the predictor variables, but B2/B1 still bears little predictive power as shown in Table 9.In contrast, the daily mean wind speed between the image date and the water-quality sampling date (Wmean) and water surface temperature (Ts) were determined to be the two most important predictor variables for TN prediction.
The high importance of water surface temperature Ts fortified the hypothesis that water temperature is related to the growth of microorganisms.The high importance of the daily mean wind speed between the image date and the water-quality sampling date (Wmean) and date difference (Doff) indicate that temperature change due to accumulated heat flux between the image date and sampling date is important.Referring to Equations 8 and 9, the mechanism involved should be the latent heat flux because latent heat flux (Equation 8) and sensible heat flux (Equation 9) are the only two components in the heat flux budget that involve wind speed.Latent heat flux is a main component of heat exchange between water and the atmosphere, and sensible heat plays a much lesser role [70].
As for TP, similar to TN prediction, the water surface temperature Ts still bears considerable importance.However, wind speed and Doff are not as important for estimating TP as it is for TN prediction, as the relative importance of variables in Table 9 indicates.It is intriguing that the square of noon solar altitude, Alt 2 , has high importance in TP prediction.Referring to Equations 6 and 7, this implies that long wave radiation cooling correlates well with TP prediction.The weak importance of wind speed and a strong importance of solar altitude for TP prediction jointly suggest that long wave radiative cooling constitutes the main process important for predicting TP concentration.
It is worth noting that, for TP prediction, optical multispectral data variables yield insignificant prediction power, and most of the prediction power is contributed by Alt 2 and Ts.
From the analyses above, it seems that determinations of TN and TP are influenced by quite different components of the net heat flux between air and water.For TN, latent heat flux seems to be the dominant factor, but long wave radiative cooling seems to be the dominant factor for TP.In other words, the difference in how the budget of cumulative heat flux between the image and sampling dates is constructed has different effects on water-quality constituents related to TN or TP.This can be reasoned by two effects: 1. TP concentration is highly correlated with chlorophyll-a [71], which is the photosynthetic pigment in algae or phytoplankton.However, optimal algae growth is at a depth within the water column [72], which means the transient change in heat flux due to wind speed at the surface will correlate poorly with algae growth.
2. Solar altitude affects not only long wave radiation, but also the penetration depth of light into the water column [73], which affects growth of phytoplankton, or algae, in water.Since the offset in correlation between chlorophyll-a concentration and TP concentration has seasonal variations [74], including solar altitude Alt (or the closely correlated square of solar altitude Alt 2 ) in the predictive model can account for the seasonal offset.
The inclusion of Alt 2 in the model probably signifies the combined effect of both effects.

Field Application
To demonstrate the utility of water-quality monitoring by satellites via our proposed method, water-quality quantities for Lady Bird Lake on May 14, 2014 were estimated using Equations 11 to 13, respectively.This date was chosen because storms occurred the previous day and also the morning of the satellite flyover day before flyover time with cumulative rainfall depth of 27 mm, likely making it easier to discern the effect of urban stormwater runoff to the lake.Figures 6 to 8 give the respective predicted spatial distribution of TSS, TN, and TP concentrations.
The water quality in the northwestern part of the lake is generally better than that in the southeastern extent, which is expected as a result of urban runoff.Lady Bird Lake has three major

Conclusions
Multiple regression-derived equations using reflectance bands, band ratios, and environmental factors as predictor variables for concentrations of TSS, TN, and TP, respectively, were derived using a hybrid forward-selection method that considers VIF in the forward-selection process.Landsat TM, ETM+, and OLI/TIRS (Landsat 8) images were all used to derive the single set of equations.The coefficients of determination of the best-fitting resultant equations varied from 0.62 to 0.79.The predictive equations were also validated by bootstrapping and LOOCV with coefficients of determination in the range of 0.46 to 0.62.
Among all chosen predictor variables, B3/B1 has the strongest influence on the predictive power for TSS retrieval.The band ratio of B3/B1 was also selected by Kloiber et al. [10] in predicting Secchi disc transparency, indicating a correlation between Secchi disc transparency and TSS.Other reflectance bands and band ratios, such as B1, B2/B1, B4/B1, and B3/B2 are also influential in estimating TN and TP concentrations, but they are not dominant factors.
Environmental factors, such as wind speed and water surface temperature, were crucial in determination of water-quality parameters in this study.Inclusion of environmental factors allows usage of a single set of predictive equations across the seasons, as such predictive equations are innately adapted to the environmental changes for different seasons.The predictive equation will also likely to be more accurate because the pooling of all observation data.
The instantaneous wind speed, W, bears considerable importance in TSS determination, which is explained by wind-generated surface ripple effects.Water surface temperature Ts (derived from satellite remote-sensor thermal band image data) is important in determination of both TN and TP concentrations, as the growth of microorganisms in water is correlated with water nutrient concentrations.
However, the time offset between the satellite image-acquisition date and water-sampling date must be accounted for in water nutrient parameter (i.e., TN and TP) retrieval.The heat flux budget between air and the water surface was considered, and components in the budget equations were included in the forward-selection procedure.In additional to the predictor variables identified above, the daily mean wind speed between the image-acquisition date and water-sampling date (Wmean) and square of noon solar altitude (Alt 2 ) were identified as the most important predictor variables for TN and TP determinations, respectively.The time difference (in days) between the image-acquisition date and water-sampling date (Doff) was also chosen for TN and TP determination.
According to the heat flux budget equations, the inclusion of Wmean, Ts, and Doff indicates the dominance of latent heat flux in the determination of TN.On the other hand, the inclusion of Alt 2 , Doff, and Ts in the TP model is an expression of the higher weight of long wave radiation cooling in TP estimation.Since chlorophyll-a concentration is highly correlated with TP concentration, we hypothesized that latent heat cooling is less important in TP determination because phytoplankton has the highest growth rate at a certain depth in water, which is less correlated with transient heat flux from evaporation at the surface.
The results showed that: In the future, the hybrid forward-selection method can be further refined to entail a stricter criterion for the inclusion of predictor variables.The default p=0.25 incurred inclusion of a few predictor variables that were not significant in the final selection of variables.
In addition, inclusion of ancillary environmental factors involving long-term averaging, such as average wind speed (Wmean), into the regression models demonstrated that it is possible to satisfactorily estimate water-quality parameters, even when a large temporal offset between satellite image-acquisition and in situ water sampling exists.Currently, the recommended longest temporal window between remote-sensor image-acquisition and water-sampling date is approximately seven days [17].Since these environmental factors are part of the heat flux equations, including environmental factors in predictive equations means an active compensation in estimation error due to the temporal offset in collecting image and water-sample data.This hypothesis needs further testing as part of future research efforts.
maintained at an approximately constant level by the pass-through Longhorn Dam [43].The surface area is ~173.6 hectares with a capacity of 905.1 ha-m.The mean depth is 6 meters, with a maximum depth over 11.7 meters [44].

Figure 1 .
Figure 1.Locations of water-quality sampling stations (i.e., Sites AC, CC, DC, and EC) on Lady Bird Lake.

3 . 4 .
Air temperature (in K): both instantaneous temperature at the time of satellite image acquisition, and daily mean air temperature between the image date and the water-quality sampling date are considered; Wind speed (in m/s): both instantaneous wind speed at the time of satellite image acquisition and the daily mean wind speed between the image date and the water-quality sampling date are considered; and 5. Noon solar altitude (in degrees): the mean noon solar altitude between the image date and the water-quality sampling date.

Figure 2 .
Figure 2. Flow chart of the hybrid forward-selection process for selecting predictor variables in multiple regression analysis.The derived multiple linear equations were then validated by bootstrapping and Leave-One-Out Cross Validation (LOOCV).Bootstrapping and LOOCV are both resampling methodologies [66].Bootstrapping assumes that samples (i.e., observations -sets of response and associated prediction variables in this case) represent the whole population, so a random resampling from the samples provides a prediction of what one expects to encounter (statistically) from unknown, future data.LOOCV leaves one sample out at a time and calibrates for the coefficients of predictor variables based on the rest of the observations.The left-out sample is used for validation.
tributaries in the metropolitan Austin area: Barton Creek, Shoal Creek, and Waller Creek.The confluence points of the three streams are indicated in Figures6-8.Barton Creek entails an extensive green belt around its riparian zone, and strict development regulations are in force because it is located within the Edwards Aquifer recharge zone[75].As a result, there is no marked change in TSS, TN, and TP at the confluence point of Barton Creek, relative to proximal areas of the lake.To the contrary, the confluence points of Shoal Creek and Waller Creek show significant increase in TSS, TN, and TP.This illustrates the effects of amount of conservation efforts spent on each watershed.The influence of Shoal Creek is more visible in Figures6-8than that of Waller Creek because Shoal Creek has a larger drainage area[76].Such details in spatial distribution can only be achieved via satellite-derived water-quality predictions and can serve as the precursor examination for more detailed water-quality examinations.

1 . 2 . 4 .
Environmental factors can constitute important ancillary variables in water-quality parameter estimation based on satellite remote-sensor images; By including environmental factors, it is feasible to pool all observation data to create a single set of predictive equations, and use it to estimate water quality for all seasons; 3. A single set of predictive equations can be determined to retrieve year-round water-quality parameters (i.e., TSS, TN, and TP) with satisfactory accuracy from Landsat TM, ETM+, and OLI/TIRS imagery on the same lacustrine water body; The derived predictive equations are robust enough to withstand the drastic change in the environment over 30+ years (1983 to 2015) while population in the metropolitan area almost tripled (from 373,000 in 1983 to 900,000 in 2015) over the same period of time[42]; and 5. Including VIF as part of the forward-selection process comprises a reliable methodology for choosing predictor variables.

Table 2 .
Summary statistics from in situ USGS water-quality stations in Lady Bird Lake, Texas, USA, over the time period 1983-2015.

Table 3 .
Secchi disc transparency measurements for in situ USGS water-quality stations in Lady Bird Lake, Texas, USA, over the time period 1983-2015 Site Code # of measurements Mean (m) Std.Dev.(m)

Table 4 )
were selected.This threshold rainfall depth is chosen based on the initial abstraction rainfall depth for a watershed with a runoff curve number of 80, since most of the urbanized area around Lady Bird Lake is residential[46].Residential districts with small lot sizes (1/4 to 1/8 acre) have a curve number ranging from 61 to 92, depending on the soil hydrologic group[47].Rainfall depth below this threshold is considered to generate insignificant runoff, and thus should have no marked effect on water quality in the lake.

Table 4 .
Dates of Landsat TM and ETM+ satellite images utilized and respective corresponding water-quality samples.
, they do not cover all dates of interest in this research.Surface temperatures have been continuously recorded and archived by Camp Mabry Austin City Airport and Austin Bergstrom International Airport every hour over the past30 years [52].Therefore, atmospheric models were selected based on the surface air temperature at the time when each satellite image was acquired (Table5).The initially-selected December 20, 2001 image was excluded from subsequent processing because it yielded negative reflectance values after FLAASH atmospheric correction.

Table 5 .
Selection of FLAASH atmospheric model based on measured surface air temperature.

Table 6 .
Comparison between image and surrogate dates in atmospheric profile determinations.

Table 8 .
Look-up table for variable abbreviation and description of variables.

Table 10 .
Validation results of the predictive equations.