Predictive performance of regression models to estimate Chlorophyll-a concentration based on Landsat imagery

Chlorophyll-a (Chl-a) concentration is a key parameter to describe water quality in marine and freshwater environments. Nowadays, several products with Chl-a have derived from satellite imagery, but they are not available or reliable sometimes for coastal and/or small water bodies. Thus, in the last decade several methods have been described to estimate Chl-a with high-resolution (30 m) satellite imagery, such as Landsat, but a standardized method to estimate Chl-a from Landsat imagery has not been accepted yet. Therefore, this study evaluated the predictive performance of regression models (Simple Linear Regression [SLR], Multiple Linear Regression [MLR] and Generalized Additive Models [GAMs]) to estimate Chl-a based on Landsat imagery, using in situ Chl-a data collected (synchronized with the overpass of Landsat 8 satellite) and spectral reflectance in the visible light portion (bands 1–4) and near infrared (band 5). These bands were selected because of Chl-a absorbance/reflectance properties in these wavelengths. According to goodness of fit, GAM outperformed SLR and MLR. However, the model validation showed that MLR performed better in predicting log-transformed Chl-a. Thus, MLR, constructed by using four spectral bands (1, 2, 3, and 5), was considered the best method to predict Chl-a. The coefficients of this model suggested that log-transformed Chl-a concentration had a positive linear relationship with bands 1 (coastal/aerosol), 3 (green), and 5 (NIR). On the other hand, band 2 (blue) suggested a negative relationship, which implied high coherence with Chl-a absorbance/reflectance properties measured in the laboratory, indicating that Landsat 8 images could be applied effectively to estimate Chl-a concentrations in coastal environments.


Introduction
Coastal environments are highly productive and complex marine ecosystems because they show the interaction of various natural and anthropogenic phenomena that provide an important source of nutrients for phytoplankton and aquatic organisms, as well as for various human activities. Nevertheless, during the last decades, studies have demonstrated that these a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 La Paz is 12 km in length, 5 km in width in an area of 45 km 2 with respect to sea level average. Morphologically speaking, the water mouth is formed by two parallel channels in their connection with the Bahía de La Paz of approximately 4 km in length and 0.6 km in width in total with an average depth of 7.0 m [36].

Field data collection
The in situ data collection was done synchronously with the overpass of Landsat 8 satellite, which passes by this zone every 16 days at approximately 17:47 UTC. Thus, field trips were made two hours before and after 17:47 UTC to avoid the effect of Chl-a variability related to tides and local currents. The Chl-a concentration was measured near the surface (~50 cm deep), taken with the multi-parameter sensor RBRmaestro model XRX-420 produced by RBR Ltd in Ottawa, Canada. No specific permissions were required for our study locations/activities since Chl-a data was collected in non-protected or private locations of the study area.
Twelve field campaigns were performed for over one year of monitoring due to bad weather (mainly high cloudiness), and field trips did not take place some dates of the period of study. Table 1 shows the dates of the field trips made, as well as some descriptive statistics of Chl-a measured in situ for the six arbitrary areas (polygons) used for time series analysis (see Fig 1 for details).
The study area was in the Landsat ID scene: LC8034043 (Path = 34, Row = 43). The images were acquired from 2016-08-24 to 2017-06-08, and only were those without cloud cover selected and cropped to highlight the study area (Fig 1). Landsat 8 images were imported and processed with the raster library [37] from programming language R [38] version 3.3.2 to obtain water pixel remote sensing reflectance of each one of the selected bands, which was calculated by using the equations in Landsat 8 user manual [13].

Statistical modeling
This study used linear regression (LR) and generalized additive models (GAM) to develop a model to estimate Chl-a concentrations from in situ data and spectral reflectance of Landsat 8 bands 1-5 images. LR explored the linear relationship between response and predictor variables; GAM explored linear or non-linear relationships between response and predictor variables throughout smooth functions (e.g. thin plate regression spline). Assuming that error terms (residuals) were independent of the predictor variables, normally distributed with mean 0 and homoscedastic [39,40].  and near infrared (NIR, B5) as predictor variables. Values of Chl-a concentration were logarithmically transformed and used as dependent variable because some authors have widely described that chlorophyll showed a non-linear relationship with Landsat bands [41,42].
LR models were constructed using a single predictor variable or more than one to highlight the number of predictor variables; we named Simple Linear Regression (SLR) when one predictor variable was used in the model and Multiple Linear Regression (MLR) when two or more predictor variables were used. All data processing and statistical models were conducted in R [38]; mgcv library was used for GAM [43].
A different band combination was tested for SLR, MLR, and GAM since previous studies on Chl-a estimation from Landsat imagery suggested that addition, multiplication, proportion, or quadratic transformation of bands 1-5 gave good results in SLR [5,23,42,44,45]. Up to two bands were combined through permutation to be used as predictor variables in SLR. Thus, for SLR we tested a total of 250 different models (S1 Table). For MLR we used band combinations with two, three, four and five predictor variables using spectral bands 1-5 without transformation. For MLR we tested a total of 26 different models (S2 Table). For GAM we used each single band and band combinations using two, three, four and five predictor variables and tested a total of 31 different models (S3 Table). To date, GAM had not been used to estimate Chl-a from Landsat imagery. In general, the models can be represented as follows: where y i was the expected value of log-transformed Chl-a concentration (μg � l -1 ); α = intercept; ß p were the coefficients of predictor variables (X p ), which were bands 1-5; and f p were smooth functions (thin plate regression spline) of the covariates; ε i the error terms (residuals) were independent of X and they were assumed to be normally distributed with mean 0 and homoscedasticity. GAMs were used with Gaussian error distribution and identity link function.

Goodness of fit and predictive performance
The quality of model fit was assessed using the R squared (R 2 ) and adjusted R squared (adj. R 2 ). These statistics were used to describe the proportion of variance explained and could take values between 0 and 1. The difference between R 2 and adj. R 2 relied in that the latter used a penalization based on the number of parameters (predictor variables); this variation of R 2 was used because it had been demonstrated that R 2 always increased when a new predictor variable was added to the model, causing an overparameterization of models. Therefore, adj. R 2 was preferred for models with two or more predictor variables. Data splitting is an effective method for evaluating predictive performance of a given model in which a portion of the data is used to estimate model coefficients, and the remainder of the data is used to measure prediction accuracy of the model. In this study data were arbitrarily separated into two subsets, training data Thus, after fitting models on the training data, their performance was measured against test data.
Predictive performance of the models was evaluated using root-mean-square error (RMSE) and correlation coefficient (R) (S4-S6 Tables); these values were calculated using observed and predicted data from the test dataset. Predicted data was computed using coefficients or smooth functions of all models; thus RMSE of all models was on the same scale (log-transformed Chla). RMSE was defined as:

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n Finally, to evaluate model assumptions, we analyzed residuals of the best fitted SLR, MLR, and GAM, respectively to identify if they were normally distributed, homoscedastic and had the presence of outliers.

Chlorophyll-a estimation from Landsat 8
Using the best fitted model and all Landsat 8 scenes available and cloud free for the study area, we obtained the estimated log-transformed Chl-a for the period May 2013-October 2017; log-transformed Chl-a was returned to its original scale by applying the exponential function.
To describe spatial and temporal variability of estimated Chl-a, we analyzed time-series data on six arbitrary areas (polygons) in the study area (Fig 1). These polygons were defined considering geographical features: (1) a semi-closed coastal lagoon (Ensenada de La Paz); (2) the channel that communicates Ensenada de La Paz with Bahía de La Paz; (3) the northernmost part of the city of La Paz; (4) the port city; (5) the second largest presence of mangroves of the study area; and (6) the southernmost portion of Bahía de La Paz locally known as El Mogote. Table 1 shows the descriptive statistics of Chl-a concentration measured during field trips, which ranged from 0.136 to 2.714 μg � l -1 ; the highest average value (1.652 μg � l -1 ) was recorded in 2017-06-08 and the lowest one (0.252 μg � l -1 ) in 2017-05-23; the highest values were recorded during a red tide event. As shown in Table 1, average Chl-a values were usually lower than 0.7.

Selection of the best fitted model
A total of 307 models were constructed and evaluated to identify which of them could explain the highest variance proportion of log-transformed Chl-a, inferred from R 2 and adjusted R 2 . Table 2 shows the three best fitted SLR, MLR, and GAM, respectively. SLR resulted in the lowest proportion of variance explained (R 2 = 0.000-0.542; adj. R 2 = -0.009-0.538); MLR in a higher proportion of variance explained (R 2 = 0.061-0.764; adj. R 2 = -0.044-0.753) while GAM resulted in the highest proportion of variance explained (adj. R 2 = 0.216-0.854). The SLR with the highest R 2 and adjusted R 2 was the ratio between B4 (red) and B1 2 (coastal/aersol), explaining 54.2% of total variance. The MLR with the highest R 2 and adjusted R 2 was that which included the five spectral bands, explaining 76.4% of total variance. The GAM with the highest adjusted R 2 was that which included four spectral bands (B1 [coastal/aerosol], B2 [blue], B3 [green], and B4 [red]), explaining 88.7% of total deviance.

Predictive performance
As mentioned in Methods, predictive performance was evaluated using Pearson coefficient of correlation (R) and root-mean-square error (RMSE) obtained from predictions of the training model on an independent data set. As shown in Table 3, the SLR with the highest R and lowest RMSE was the model with the ratio between B4 and B1 2 ; the MLR with the highest R and lowest RMSE was the one that included four bands (B1 [coastal/aerosol], B2 [blue], B3 [green], and B5 [NIR]); and the GAM with the highest R and lowest RMSE was the one that included B1, B2, and B3. These results indicated that three modeling approaches could predict logtransformed Chl-a with high accuracy. Fig 3 shows residual patterns of best fitted SLR, MLR, and GAM, respectively. As it can be observed, residuals of the three models seemed to have normal distribution with mean 0, no marked trend (homoscedasticity) in residuals versus fitted, and absence of outliers. Therefore, model assumptions seemed to be fine.

Optimal model for estimating Chl-a
Given the results of Tables 2 and 3 ) was considered as the best model to predict log-transformed Chl-a in the study area. Table 4 shows coefficients of this model, from which, we could infer that logtransformed Chl-a had a positive linear relationship with bands B1, B3, and B5 and negative linear relationship with band B2. Coefficients of this model suggested that B2, B3, and B1 had the strongest linear relationship with log-transformed Chl-a; on the contrary B5 had the weakest linear relationship with log-transformed Chl-a.       Chlorophyll-a based on Landsat imagery

Discussion
This study evaluated the use of Landsat 8 for estimating Chl-a concentration in the coastal water body located in northwestern Mexico by field data collection, simple linear regression, multiple linear regression and generalized additive models, using as response variable logtransformed Chl-a and reflectance values as spectral predictive variables of the visible part of light and NIR. The results obtained suggested that the use of spectral bands 1 (coastal/aerosol), 2 (blue), 3 (green), and 5 (NIR), from the MLR model, allowed us to reliably estimate the concentrations of Chl-a in a coastal environment.
To date, a large number of available publications have demonstrated that Chl-a can be estimated using Landsat satellite images by in situ data collection and using simple or multiple linear regression models, but something that attracted our attention was the great diversity of approaches that have been used for this purpose. For example, some authors have used simple models where the predictive variable was one of the spectral bands [31,32,46]; other authors suggested that the ratio of two spectral bands could be used as a good predictor of Chl-a [18,28,29,45,47]; finally, other authors suggested that various combinations of spectral bands in multiple linear regression models allowed a better estimation of Chl-a in aquatic environments [23,32,[45][46][47].
These approaches generated uncertainty as to which was the best one to estimate Chl-a in aquatic environments. This study used a statistical approach and the Chl-a absorption/reflection theory to create the best possible model, in such a way, that MLR was constructed using spectral bands where Chl-a had its greater absorption/reflection. According to what several researchers have demonstrated, Chl-a had its highest light absorption at wavelengths from 400-500 nm (blue) and 680 nm (red) and its maximum reflection up to 550 nm (green) and 700 nm (NIR). Thus, a negative correlation was expected between Chl-a and reflectance in the blue band; that is, the higher the concentration of Chl-a, the lower reflectance in this wavelength. On the other hand, a positive correlation between Chl-a and reflectance in the green and NIR bands was expected; in other words, the higher the concentration of Chl-a, the higher reflectance in these wavelengths [45,[48][49][50][51].
Initially, this study evaluated the linear correlation between log-transformed Chl-a and spectral reflectance in five wavelengths (coastal/aerosol, blue, green, red, and NIR) using Pearson correlation coefficients; however, two things that attracted our attention in the results obtained were (1) low correlation (r < 0.2) between Chl-a and the selected spectral bands and (2) correlation between Chl-a and the red and NIR bands, which had an opposite sign than that expected. In this regard, several authors have found higher or lower values of Pearson correlation coefficient and Chl-a; for example, Lim & Choi [44] found correlation values greater than 0.6 among the blue, green, red, and NIR bands and Chl-a; however, their results suggested inverse relationships because all correlation values were negative. On the other hand, Patra et al. [45] found correlations smaller than 0.5 and positive among blue, green, red and NIR bands and Chl-a. In both cases, the estimation of Chl-a was performed in freshwater bodies (rivers and lakes), which could have generated these differences with what was found in our study. Usually in freshwater bodies, such as rivers and lakes, turbidity (caused by particulate organic matter) is several times greater than in marine bodies [15,52,53].
Other authors have suggested that the combination of spectral bands by way of ratio (e.g. NIR/red) had a higher correlation with Chl-a [18,[44][45][46]54]. In this regard, our study found higher correlation values between Chl-a and the red and squared transformed coastal/aerosol (B4/B12) band ratio. Another interesting point in this study was the use of the coastal aerosol band (B1) because when it was included, the models increased the correlation value (R) and decreased the value of RMSE. This band was constituted by wavelengths that detect deep blue and violet very similar to the blue band characterized by low reflectance in environments with high Chl-a concentration. According to Slonecker et al. [26] and Loyd [55], this feature makes the band potentially important for investigating coastal phenomena.
As mentioned above, the selected MLR was used to perform statistical inference, in this particular case, to evaluate the linear relationship between spectral bands and Chl-a by using the coefficients of multiple linear regression models. Our results showed a high concordance between the observed (model) and expected (absorption/reflection properties of Chl-a) results, specifically the negative coefficient of the blue (B2) band and positive coefficients of the green (B3) and NIR (B5) bands. In this regard, Brivio et al. [48] and Lim & Choi [47] used multiple linear regression models to estimate Chl-a (among others) through the use of different Landsat spectral bands; however, the coefficients of the best model used to estimate Chl-a showed the opposite expected signs. For example, positive values with the blue (B2) and negative with green (B3) bands.
To date, many studies have addressed estimating Chl-a from images acquired by satellites, both in freshwater bodies and marine environments, obtaining promising results. Nonetheless, when comparing methods and results, they have shown a great discrepancy in the way Chl-a has been estimated from Landsat images; it may be due to the wavelength used (or proportions among them), the type of statistical method, or the type of environment where the study was performed. All indicates that the methods applied in a specific place or environment cannot be replicated similarly in another place and/or different environment, which suggests the need for greater field validation and spatial or temporal coverage, or plainly and simply a comparison of Chl-a estimation with a standardized method in different types of the aquatic environments required.

Conclusions
This study has evaluated the performance of simple and multiple linear regression and generalized additive models to estimate Chl-a concentration, using the first five bands of Landsat 8 images in the Bahía de La Paz, Baja California Sur, Mexico. The obtained results indicated that this method provided a reliable estimation of Chl-a in small coastal water bodies because of the high coherence found in model coefficients with the absorption/reflection properties of Chl-a evaluated in the laboratory under controlled conditions. Therefore, remote sensing has shown to represent an ideal opportunity to develop regional scale research on various parameters in environments estimated in small coastal water bodies to allow a constant monitoring at low cost and high-quality spatial scale.
Supporting information S1