Shortwave solar radiation is an important component of the surface energy balance and provides the principal source of energy for terrestrial ecosystems. This paper presents a machine learning approach in the form of a random forest (RF) model for estimating daily downward solar radiation flux at the land surface over complex terrain using MODIS (MODerate Resolution Imaging Spectroradiometer) remote sensing data. The model-building technique makes use of a unique network of 16 solar flux measurements in the semi-arid Reynolds Creek Experimental Watershed and Critical Zone Observatory, in southwest Idaho, USA. Based on a composite RF model built on daily observations from all 16 sites in the watershed, the model simulation of downward solar radiation matches well with the observation data (r2 = 0.96). To evaluate model performance, RF models were built from 12 of 16 sites selected at random and validated against the observations at the remaining four sites. Overall root mean square errors (RMSE), bias, and mean absolute error (MAE) are small (range: 37.17 W/m2-81.27 W/m2, -48.31 W/m2-15.67 W/m2, and 26.56 W/m2-63.77 W/m2, respectively). When extrapolated to the entire watershed, spatiotemporal patterns of solar flux are largely consistent with expected trends in this watershed. We also explored significant predictors of downward solar flux in order to reveal important properties and processes controlling downward solar radiation. Based on the composite RF model built on all 16 sites, the three most important predictors to estimate downward solar radiation include the black sky albedo (BSA) near infrared band (0.858 μm), BSA visible band (0.3–0.7 μm), and clear day coverage. This study has important implications for improving the ability to derive downward solar radiation through a fusion of multiple remote sensing datasets and can potentially capture spatiotemporally varying trends in solar radiation that is useful for land surface hydrologic and terrestrial ecosystem modeling.
Citation: Zhou Q, Flores A, Glenn NF, Walters R, Han B (2017) A machine learning approach to estimation of downward solar radiation from satellite-derived data products: An application over a semi-arid ecosystem in the U.S. PLoS ONE 12(8): e0180239. https://doi.org/10.1371/journal.pone.0180239
Editor: Dafeng Hui, Tennessee State University, UNITED STATES
Received: March 20, 2017; Accepted: June 12, 2017; Published: August 4, 2017
Copyright: © 2017 Zhou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data can be publicly accessed through the following website: ftp://ftp.nwrc.ars.usda.gov/reynolds-creek-datasets/climate/hourly-provisional/.
Funding: This work was supported by the NASA EPSCoR program, Award #NNX14AN39A. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Shortwave (0.3–5.0 μm) solar radiation is the principal source of energy to drive photosynthesis in Earth’s terrestrial ecosystems. As such, characterizing the measurement and spatiotemporal variation in solar fluxes is important in physics, biology, chemistry, hydrology, and other natural sciences. Additionally, solar radiation is the largest component of the available energy to drive evaporation from the surface, underscoring its importance as a variable that connects land-atmosphere fluxes. Because of its role in controlling surface energy balance, moreover, solar radiation indirectly contributes to soil microbial processes through its impact on ground heat flux and the subsurface distribution and dynamics of soil temperature. Changes in solar radiation are associated with global biogeochemical cycling through impacts on photosynthesis. Analyses of tropical Net Primary Production (NPP), for instance, suggest that increasing solar radiation has led to increases in NPP . Evapotranspiration (ET) is dependent on downward solar radiation, which provides the energy to evaporate water. Based on previous studies [2, 3], both pan evaporation and downward solar radiation have decreased over the last 50 years.
Downward solar radiation flux is also an important land surface parameter for ecological, land surface hydrology, and weather forecast models such as the Community Land Model , Biome-biogeochemical (Biome-BGC) , Photosynthesis evapotranspiration—biogeochemical model (PnET-BGC) [6–8], general circulation models (GCMs)  and the Weather Research and Forecasting Model (WRF) . Within these models the downward solar radiation flux is either required as an input parameter or, in the case of WRF and other coupled land-atmosphere models, produced as an output parameter. The accuracy of the input downward solar radiation directly affects the corresponding accuracy of model outputs related to surface energy budgets like primary production, evapotranspiration, and indirectly impacts other parameters such as infiltration, runoff, and chemical solutions in the stream water [9, 11–12]. Models of coupled land-atmosphere dynamics, such as WRF, produce solar radiation fluxes as an output, capturing the impact of clouds on surface solar radiation either through parameterizations or by explicitly resolving clouds. The ability to verify model-predicted solar radiation at the surface against observational information, therefore, would enhance the ability to assess and characterize errors of both land surface hydrological states and fluxes and also the effects of simulated atmospheric conditions on the attenuation of solar radiation from the top of the atmosphere. Observational information used for verification of input or output surface downwelling solar flux would ideally capture spatiotemporal patterns in solar radiation at spatial resolutions approaching those of the model being used. However, most observational solar flux information is available only at the point scale. The ability to deduce spatial correlates of solar radiation from networks of point-based surface observations and use that information to generate spatiotemporal predictions of downward solar flux would, therefore, substantially improve land modeling efforts.
Traditionally, three different methods have been used for obtaining downward solar radiation information, all of which have strengths and limitations. Ground-based pyranometers are a relatively inexpensive way to obtain estimates of hemispherical solar radiation flux that use a voltage-generating thermopile that is excited by exposure to solar radiation. While they provide accurate estimates of solar radiation with high temporal resolution, networks of pyranometers are typically not available in sufficiently high spatial coverage to resolve spatial patterns . Sparseness in spatial coverage is particularly prevalent in complex and mountainous terrain where placing monitoring stations is logistically challenging. An alternative method for calculating solar radiation is to use mathematical or empirical models. The method of Hargreaves and Samani  uses maximum and minimum daily temperature to estimate the downward solar radiation. Although this empirical method for estimating solar radiation is relatively simple and can be made with commonly available meteorological observations, it is based on the assumption that solar radiation is related to the difference between maximum and minimum temperature and the fraction of extraterrestrial radiation received at the ground level, which results in model uncertainties. Other models such as the Angstrom-Prescott model [15–16] use site-specific model parameters to obtain downward solar radiation. However, these parameters are based on ground based measurements and limited by these measurements . Finally, a number of studies have used remote sensing data to estimate downward solar radiation using the split window technique [18–19] or look up table method . The major advantage of using remote sensing information is that it provides spatiotemporal coverage of the land surface, which potentially supports the development of long-term databases of downward solar radiation. However, the split window technique requires parameterizations of surface variables such as air temperature and vapor pressure  and many parameters are assumed constant in space and time. Additionally, validation of estimates of solar radiation derived from remote sensing data are difficult because there are few observational constraints other than the remote sensing data used as input to the method itself.
We propose here a complimentary technique that integrates both ground-based and remote sensing observations to predict spatiotemporal patterns in downward solar radiation. The resulting method leverages the accuracy of ground-based pyranometers together with the spatiotemporal coverage afforded by remote sensing data. The method is specifically based on machine learning algorithms widely used in climatology and remote sensing [22–23]. Compared with traditional methods for estimating downward solar radiation, a machine learning approach holds several key advantages. A machine learning approach can (1) be used to identify those variables that are most powerful in describing spatiotemporal variation in downward solar radiation, (2) provide explicit mechanisms for quantifying uncertainties in predicted values of solar radiation, (3) leverage diverse kinds of remote sensing data including multispectral imagery and lidar-derived vegetation and elevation characteristics, (4) capture potentially non-linear relationships between independent and dependent variables, and (5) provide an assessment of model robustness.
The overarching goals for this study are to: (1) test the degree to which a machine learning approach using random forests can accurately develop predictive models of surface downwelling solar radiation using a combination of variables from remote sensing datasets, (2) understand and provide justification for the presence and prevalence of predictor variables used in the random forest model, (3) analyze the uncertainties in predictions of surface downward solar radiation, and (4) use the random forest model to extrapolate predictions to the scale of an entire watershed and assess the derived spatiotemporal patterns.
2.1 Research area
Reynolds Creek Experimental Watershed (RCEW) is 239 km2, located in the rangelands of the Owyhee Mountains in southwestern Idaho, USA (Fig 1). The US Department of Agriculture’s Agricultural Research Service (ARS) established RCEW in 1960 as an experimental platform to understand and characterize impacts of rangeland management activities on hydrology, ecology, and geomorphology. Since its establishment RCEW has been the focal point of many studies focusing on terrestrial vegetation, soil science, hydrology, and hydroclimatology, and most recently as a Critical Zone Observatory (CZO). The primary drainage of the watershed, Reynolds Creek, flows primarily from south to north. Elevation in RCEW ranges from 1099 m at the outlet weir to approximately 2093 m at the southern end of the watershed . The RCEW is a semi-arid ecosystem dominated by sagebrush-steppe in the lower elevations and large stands of coniferous and deciduous trees at higher elevations in the watershed. Mean annual precipitation varies greatly in both amount and phase in RCEW, with about 240 mm falling (primarily as rain) at lower elevations in the watershed and greater than 1100 mm falling (primarily as snow) at higher elevations [25;