Mapping land cover change over continental Africa using Landsat and Google Earth Engine cloud computing

Quantifying and monitoring the spatial and temporal dynamics of the global land cover is critical for better understanding many of the Earth’s land surface processes. However, the lack of regularly updated, continental-scale, and high spatial resolution (30 m) land cover data limit our ability to better understand the spatial extent and the temporal dynamics of land surface changes. Despite the free availability of high spatial resolution Landsat satellite data, continental-scale land cover mapping using high resolution Landsat satellite data was not feasible until now due to the need for high-performance computing to store, process, and analyze this large volume of high resolution satellite data. In this study, we present an approach to quantify continental land cover and impervious surface changes over a long period of time (15 years) using high resolution Landsat satellite observations and Google Earth Engine cloud computing platform. The approach applied here to overcome the computational challenges of handling big earth observation data by using cloud computing can help scientists and practitioners who lack high-performance computational resources.


Introduction
Quantifying and monitoring the spatial and temporal dynamics of the global land use land cover (LULC) is essential for better understanding many of the Earth's land surface processes. Human-induced land cover changes may affect land surface processes including urbanization, drought, and flood which impact the world's population [1,2]. Understanding these changes can allow quantifying and monitoring trends in agriculture [3], fresh water resources [4], forest cover [5][6][7], and disease transmission [8,9]. Lack of high spatial resolution (30 m) and regularly updated LULC data limit our ability to quantify the spatial extent and monitor the temporal dynamics of these changes. Earlier attempts to generate continental-scale LULC products were limited to coarse spatial scale (250m-1km) which lacked sufficient spatial details [10][11][12][13].
There is a need for a high spatial resolution (30 m) and regularly updated LULC data to improve our understanding of changes in the land surface processes. However, there are challenges in generating global to continental scale LULC maps from high spatial resolution PLOS  Landsat observations that span over a long period of time. These challenges have included the lack of freely available earth observation data and high-performance computing to process and analyze these data. Starting from 2008, the United States Geological Survey (USGS) has been distributing Landsat data at no cost for public use. This created an opportunity for scientists and practitioners to use Landsat satellite data to map and monitor LULC at continental to global scale and over a long time series [5,7,14]. Since the USGS released the Landsat data archive for public use, there has been an increasing trend in efforts to map land use land cover using these data [15][16][17][18]. However, most of these efforts have been limited to local scale analyses due to the computational requirements of analyzing large volumes of data. For example, nearly~10,000 Landsat scenes or~3 terabytes are required to produce a global land cover map for any given time point [19]. In recent years, there has been an increase in the availability of high performance cloud computing. For example, the NASA Earth Exchange (NEX) platform allows the processing and analysis of NASA earth observation data [20]. Amazon Web Service (AWS) also now provides access to the Landsat data archive, enabling analysis of this dataset on the cloud. Google Earth Engine (GEE) is a new high-performance computing platform which gives access to a vast and growing amount of earth observation data as well as processing power to analyze these data at planetary scale.
The launch of these high-performance cloud computing platforms has opened the door to global-scale geospatial data storage, processing and analyzing at a low cost and efficient manner in the cloud [7,19,21]. One of the first global scale applications of the Landsat data archive was a study by Hansen et al, which used Landsat satellite data and machine learning to map global forest cover over the 2000-2012 period [7]. Although the focus of their study was to quantify global scale changes in forest cover, theirs is the only recent effort to apply high spatial resolution (30 m) Landsat satellite observation data for mapping global scale land cover over a long period of time (i.e. from 2000-2012). In our study, we present the use of Landsat satellite observation data and GEE cloud platform to map land cover and impervious surface changes over continental Africa for 2000-2015.

Earth observation data
In this study, Landsat 7 Enhanced Thematic Mapper Plus (ETM +) surface reflectance data which was computed using the Landsat Ecosystem Disturbance Adaptive Processing System (LEDAPS) were used [22]. A cloud screening algorithm was applied using quality assessment (QA) bands in order to remove snow and cloud contaminated pixels for each Landsat image between 1999 and 2016. Annual composites were then produced by taking the median value from images from the target year, plus or minus one year. For example, for the year 2000, pixel values were obtained by calculating the median of all cloud-free pixels from images between January 1 st , 1999 and December 31 st , 2001. A three-year window was used to ensure that at least one cloud-free pixel was available for each annual composite. Similarly, in addition to producing annual raw image composites, annual normalized difference vegetation index (NDVI) and normalized difference water index (NDWI) were computed by taking the median value from 3-year windows. Additionally, annual inter-calibrated night-time light images were used from the National Oceanic and Atmospheric Administration (NOAA) [23][24][25].

Training data
As used by a number of other studies [8,[26][27][28], training data were derived from visual inspection of freely available high spatial resolution imagery. Recently captured (2015-2016) high-resolution satellite imagery were visually inspected and used to identify Landsat pixels that were entirely made up of one of 7 classes (water, impervious surface, high biomass, low biomass, rock, sand and bare soil) to act as training data. Impervious surface class included asphalt roads, concrete, metal roofs and other built infrastructure while low biomass class included crop field, grass, and shrubland. High biomass class consisted of dense forest. In an effort to ensure that these training data were representative of the classes across the continent, we ensured that training data were captured from all 49 countries in continental Africa, with the aim of capturing 1000 training points per class. For model validation purposes, 80% of sample points (5,664) from each class were randomly selected to act as training data, with 20% of sample points (1,420) withheld as a validation dataset (Fig 1).

Modeling
For each of the training and validation data points, Landsat spectral bands, NDVI, NDWI and night-time light from the 2015 layers were extracted to be used as covariates in the model ( Fig  1). To model and predict the 7 LULC classes, a random forest decision tree classification algorithm was deployed on GEE. Decision tree classification algorithms have been used widely to classify satellite data for forest cover mapping [28,29], wetland mapping [8,26,30,31], cropland mapping [32], and land cover mapping [33][34][35][36]. A random forest is a nonparametric machine learning method comprised of ensembles of decision trees [37]. The random forest algorithm creates multiple decision trees which classify a random subset of the training data according to the covariate predictors. In our study, we used 500 trees in the random forest classification. The final classification was based on the majority vote from all the trees. To generate final LULC layers across Africa 2000-2015, training and validation data were combined into a single dataset before the model was run, with annual predictions made from this model using the annual covariate layers. Having made annual predictions, the changes in the area represented by each class from 2000 to 2015 across continental Africa were calculated. This also allowed for an exploration of the direction of change for each LULC class from 2000 to 2015.
As well as modeling 7 classes, estimating the likelihood of a given pixel being impervious was a focus of this study. To do this, the 6 non-impervious surface classes in the training and validation data were collapsed into a single class, resulting in a binary outcome of impervious and non-impervious. A random forest model was then applied to the binary outcome. This model was used to predict the probability (i.e. the proportion of times the model voted) that a pixel was impervious. As for predictions of the 7 LULC classes, final predictions across Africa 2000-2015 were made by combining training and validation data before running the model. All analyses were performed on GEE cloud platform.

Model validation
Using the validation data, a confusion matrix was generated to evaluate predictive accuracy across classes as well as overall accuracy. This allowed an assessment of user's accuracy (the number of correctly classified pixels divided by the total number of pixels predicted within that class) and producer's accuracy (the number of correctly classified pixels divided by the total number of pixels truly in that class) for each class. For the model focused on impervious surface, an assessment of predictive accuracy was made by calculating both a confusion matrix as well as area under the curve (AUC) statistic. The AUC statistic represents the probability that a randomly selected truly impervious pixel will be ranked higher by the model than a truly non-impervious pixel. As such, it provides a measurement of the discriminatory power of the model. In addition to validating the model using out of sample predictions, we compared our predictions to maps of percent tree cover generated by Hansen [7]. This dataset was chosen as it was conducted at a comparable spatial (30 m) and temporal (annual) resolution. As the Hansen study was focused on percent tree cover, rather than LULC, this comparison focused on forest cover only. To compare results, we first classified the Hansen product for 2000 in to high biomass (greater than 30% tree cover) and non-high biomass (less than 30% tree cover) using a threshold of 30% tree cover. This threshold was chosen because it is the closest to our high biomass class. We then compared our high biomass class predictions for 2000 to the high biomass class derived from the Hansen data. Furthermore, we compared our 2010 prediction maps with the 2010 Global Land Cover product (GlobeLand30) at 30 meter spatial resolution which was generated using Landsat data [38]. The GlobeLand30 product consists of 10 classes including water bodies, wetland, built-up, cropland, snow, forest, shrubland, grassland, bareland, and tundra. We chose to compare our 2010 land cover product with the 2010 Globe-Land30 product since theirs is available only for years 2000 and 2010.

Land use land cover classification
A total of 7,084 data points were captured across the 7 classes, resulting in 5,664 training data points and 1,420 validation data points. Table 1 shows the confusion matrix of observed versus predicted values using the withheld validation data. The model achieved an overall accuracy of 88 percent (Table 1). Class specific user's and producer's accuracies ranged from 84-94% and 79-96% respectively. Water appeared to be most accurately predicted, with user's and producer's accuracy of 94 and 96% respectively, while low biomass was least accurately predicted with user's and producer's accuracies of 84 and 79% respectively.
Annual LULC predicted maps that consist of seven classes (impervious surface, low biomass, high biomass, bare soil, sand, rock, and water) for the 2000-2015 period were generated based on the random forest classification model. Comparison of high biomass class with the Hansen forest map for the 2000 period showed good agreement. Almost 80% of the total high biomass predicted by our model (4,174,958 Km 2 out of 5,156,559 Km 2 ) matched with the Hansen high biomass class (percent of tree cover greater than 30%) as shown in Fig 2. Generally, there was a good agreement between products in the Congo Basin and western Africa whereas there were mismatches in the East African highlands and Nile Delta regions, with our model predicting more high biomass in these regions. For the 3 sites selected for comparison purposes (Fig 3), a total of 1,465 Km 2 were classified by the GlobeLand30 as built-up while our model predicted a total 926 Km 2 as impervious surface for the 2010 period. Additionally, a Table 1  Change analysis showed that impervious surface class had the highest relative increase from 2000 to 2015 (38.54%) while high biomass and rock had the highest decreases from 2000 to 2015 with a decrease of around 17% (Fig 7 and Table 2).There was relatively little change in the areas covered by water or sand. An exploration of annual change in the probability of being impervious suggests that changes over the study period were in both directions as shown in four selected regions between 2000 and 2015 (Fig 6). High biomass and rock showed a declining trend whereas impervious surfaces, low biomass, and bare soil showed increasing trend.

Impervious Surface Low Biomass High Biomass Bare Soil Sand Rock Water Total User's Accuracy (%)
Growth in the impervious surface was mainly due to conversion from low biomass class to impervious surface. As shown in Figs 8 and 9, sand and water classes were the most stable and did not show substantial change. This study also showed that the decrease in high biomass class was predominantly due to high biomass becoming low biomass (Fig 9).

Discussion
Regularly updated, high resolution, and continental-scale land cover data are essential for quantifying and monitoring the Earth's dynamic land surface processes. Despite the free availability of high spatial resolution Landsat satellite data, generation of continental to global scale LULC maps have been challenging due to the need for high-performance computing to store, process and analyze such a large volume of satellite data. As such, earlier efforts to produce land cover products at this scale have been limited to coarse spatial resolution. Platforms such as GEE have opened the door to planetary scale analyses and offer the opportunity to provide a mechanism to continually monitor the Earth's surface at high spatial and temporal resolution. The utility of GEE to quantify various land surface changes has been demonstrated for forest mapping [7], population mapping [39], cropland mapping [40], and surface water mapping [21]. Here we presented an approach to quantify continental land cover change over a long period of time (15 years) using GEE and Landsat satellite observations.
In the present study, we produced annual LULC maps for continental Africa between 2000 and 2015 showing that the continent has dramatically changed during the study period. Our results indicated area covered by high biomass class showed a declining trend during the study period. These findings were in concordance with Hansen et al who showed a dramatic forest loss between 2000 and 2012 period in subtropical Africa [7]. In an effort to compare our product with existing data, we made comparisons with the Hansen data. While not the same products, a comparative assessment of our high biomass class and the high biomass class generated from the Hansen product (greater than 30% tree cover) for the 2000 period showed 80% agreement. Overall, our product showed very good agreement with the Hansen data. Additionally, comparison of our product with the GlobeLand30 showed good agreement. However, some of our land cover classes and classes from the GlobeLand30 product were different. For example, our low biomass class was represented in three separate classes including wetland, cropland, and grassland in the Globeland30 data. As a result, we focused our comparison with our high biomass class and their forest class as well as our impervious surface and their built-up class. Furthermore, there also appeared to be a large relative increase in man-made impervious surfaces during the 2000-2015 periods. These changes in impervious surface were mostly the conversion of low biomass areas to impervious surfaces. This increase has occurred steadily throughout the period of study, which appears in line with increasing urban growth in Africa [41].
Multi-temporal satellite observations and cloud-based analytic platforms such as GEE could enable cost-effective production of LULC products at low cost and efficient manner. The main advantages of using GEE include reduced processing times and enhanced capacity to perform global scale analysis on high resolution data. There are, however, challenges to using GEE. Some of these challenges include mastery of JavaScript and Python programming languages, limited number of GEE built-in functions, and lack of integration for the current GEE platform with other open source geospatial analytic tools such as R, and QGIS. Although GEE archives a large library of earth observation and geospatial data, data is not available in real-time which may limit its utility to some operational applications that need real-time data access.
While this study suggests that the approach taken here can be used to produce continental LULC products, it should be pointed out that this study has a number of limitations. Firstly, training and validation data were only available for 2015. Having more data points throughout the time period would likely improve the accuracy of the annual maps. This would also allow a better assessment of predictive performance through time. That said, as the spectral signature   https://doi.org/10.1371/journal.pone.0184926.g008  of these 7 classes is unlikely to have changed through the time period, predictions are likely to be robust. Secondly, the study relied on visual inspection of high-resolution imagery to produce training data; ground truth data from the field were not available. This may have resulted in some misclassification. Additionally, relying on a single image to represent a 1 year period ignores the fact that some training data points may move between classes seasonally (i.e. bare soil in the dry season and low biomass in the wet season). Thirdly, we restricted our analysis to 7 LULC classes which may limit the utility of the current LULC product to some applications. With accurate training data on a wider variety of LULC classes, it may be possible to use the approach described here to produce maps of more than 7 classes used here. Although this study encountered the aforementioned limitations, the standardized approaches demonstrated here and model validation results indicated that the LULC maps presented in this research had high prediction accuracy. The approach used here to overcome the computational challenges of handling big earth observation data can help scientists and practitioners who lack high-performance computational resources. Future studies can expand on our study and apply our approach to generate global-scale land cover products. The LULC product presented in this study will be freely available (https://geodata.globalhealthapp.net/lulc/) for public use and can be used in other applications to monitor changes in disease transmission, natural resources, biodiversity, and urbanization.