Crop loss identification at field parcel scale using satellite remote sensing and machine learning

Identifying crop loss at field parcel scale using satellite images is challenging: first, crop loss is caused by many factors during the growing season; second, reliable reference data about crop loss are lacking; third, there are many ways to define crop loss. This study investigates the feasibility of using satellite images to train machine learning (ML) models to classify agricultural field parcels into those with and without crop loss. The reference data for this study was provided by Finnish Food Authority (FFA) containing crop loss information of approximately 1.4 million field parcels in Finland covering about 3.5 million ha from 2000 to 2015. This reference data was combined with Normalised Difference Vegetation Index (NDVI) derived from Landsat 7 images, in which more than 80% of the possible data are missing. Despite the hard problem with extremely noisy data, among the four ML models we tested, random forest (with mean imputation and missing value indicators) achieved the average AUC (area under the ROC curve) of 0.688±0.059 over all 16 years with the range [0.602, 0.795] in identifying new crop-loss fields based on reference fields of the same year. To our knowledge, this is one of the first large scale benchmark study of using machine learning for crop loss classification at field parcel scale. The classification setting and trained models have numerous potential applications, for example, allowing government agencies or insurance companies to verify crop-loss claims by farmers and realise efficient agricultural monitoring.


Introduction
Future food production is challenged by increasing demand for more sustainable agricultural systems that consider environmental, economic and social dimensions of sustainability. Remote sensing data from spaceborne platforms offer a possibility to address these challenges [1]. Multispectral satellite remote sensing applications in agriculture date back to the early 1970s with the launch of Landsat 1 by the National Aeronautics and Space Agency (NASA). Applications include agriculture land use mapping [2], agricultural monitoring [3], leaf area index (LAI) and biomass estimation [4,5], precision agriculture [6], agricultural water management [7], estimation of crop yield [8][9][10][11][12][13], and crop damage assessment caused by floods [14,15] and lodging [16]. These studies have focused mostly on the regional scale due to a spatially limited reference data at a field parcel scale. Only a few studies have considered the use of remote sensing data from spaceborne platforms to assess crop loss at field parcel scale. For example, multispectral spaceborne remote sensing data from Landsat ETM+ was used to study the effect of sowing date and weed control during fallow period on spring wheat yield in Mexico by [17]. Based on experiments on 100 fields across three seasons they concluded that the effect of sowing data and weed control on yield can be estimated using multispectral spaceborne remote sensing data. Tapia-Silva et al. [14] studied crop losses due to flood using Landsat TM/ETM on 132 field parcels across 15 seasons and concluded that modelling crop loss was challenging. Data from Sentinel-1 and 2 satellites were used by [18] for cyclone damage assessment on 200 coconut and 200 rice fields in India with promising results. Crop damages on 600 wheat fields were also studied by [19] in Greece using spaceborne multispectral imagery and ancillary geospatial data, but they faced the challenge of defining field parcels based on low resolution imagery available from spaceborne platforms. Two recent studies [20,21] used Synthetic Aperture Radar (SAR) images to assess crop damage due the 2020 wind storm Derecho. They apply their method across the state of Iowa, United States to estimate the area of crop damage and verify their methods with ground truth data of a maximum of 14 fields. A common theme across many of the above studies is the limited number of ground truth data from field parcels used to evaluate the proposed methods. This is understandable because collecting large amounts of ground truth data from many field parcels is challenging. In summary, to the best of our knowledge, remote sensing applications have been limited to specific crop-loss type based on its cause such as flooding or drought and are also limited to coarse spatial scale due to lack of ground truth data. A large study about the use of remote sensing data for monitoring general crop loss (not specific to any causal factor) at field parcel scale is desirable. Because it can aid both public and private organisations like insurance companies and governments, respectively, in agricultural monitoring [22,23]. Such a study can also aid in effective implementation of Common Agricultural Policy (CAP) reforms enacted by the European Commission (EC) [24]. In fact EC has started projects such as Sentinels for CAP (Sen4Cap), aimed at developing new remote sensing methods for agricultural monitoring at scale [25].
There are many drivers for crop loss. Yield potential in a given field and region depends on the crop and cultivar. In addition to a soil type and weather conditions, farmer's decisions have an impact on yield potential and the risk for crop loss. The risk of crop loss can be reduced by using quality seeds [26], planting well-adapted cultivars [27], matching crops to the most appropriate field parcels [28], as well as by adopting timely and accurate management practices such as sowing, crop protection and harvesting. For high-latitude agricultural systems risks caused by variable weather are substantial, and total large scale crop failures may occur once or twice a decade [29]. In this study, we use the definition of crop loss of Finnish crop damage compensation program. It is given by the percentage of area of field parcel. This definition decouples crop loss from its driver allowing the study of general crop loss irrespective of the driver.
This study aims to test the feasibility of combining machine learning (ML) models with optical satellite data to classify field parcels with and without crop loss. The reference data used

Study area and crop loss data
The study area includes southern and western regions of  The reference data on crop loss was provided by Finnish Food Authority (FFA). The data consisted of field parcel ID, field boundary (see example area in Fig 3), area, crop type and variety, crop loss (as area of the field parcel) and farm ID for the years from 2000 to 2015. The data originates from the crop damage compensation system in Finland that started in 1976 and lasted until 2015 [30]. The analysis was made with barley as it most amount of reference data among all the crops.
The crop loss data were collected through a self-reporting survey where the farmers reported crop loss as percentage of the area of the field. This was processed into a binary variable where anything greater than zero percent indicated crop loss (1) and everything else as indicated no loss (0). The number of field parcels with and without crop loss for each year from 2000 to 2015 is shown in Table 1. Over all years and the whole area of investigation, there were 33,840 field parcels (2.38%) with crop loss and 1,418,872 (97.62%) with no loss. We observed that larger fields had reported loss more often than smaller fields so the effect of field size on crop loss classification performance was examined (see the last part of Section 3.2).
The reference data also includes reasons for crop loss that were sporadically provided by the farmer.

Satellite data
Landsat 7 ETM+ (Enhanced Thematic Mapper [31]) satellite data was chosen for the study to cover the area and time frame of the reference data. Landsat 7 was launched in April 1999 and is still operating as of May 2020. It carries a multispectral sensor, which provides 8 bands covering the visible range, near-infrared and mid-infrared range as well as one thermal infrared  and one panchromatic band. All bands are provided with a spatial resolution of 30 m except panchromatic and thermal infrared bands, which are provided with 15 m and 60 m resolution, respectively. The revisit time of the satellite to a specific point on earth is 16 days.
All available surface reflectance products [32] from January 2000 to December 2015 were requested from the United States Geological Survey (USGS) and downloaded using the EROS Processing Architecture software where EROS stands for Earth Resources Observation and Science. No filters were applied to the query other than the path and row indicators for the area of interest with least spatial overlap. The query resulted in 597 scenes. An overview of the number of scenes acquired per year can be found in Table 2. The surface reflectance product also includes a Quality Assessment (QA) band indicating the cloud cover based on the CFMask algorithm [33]. The QA band was used to generate a binary cloud mask. Note that, the surface reflectance product is not processed when the solar zenith angle is larger than 76 degrees. Thus data availability is limited, since the study area is above 60˚North.
In 2003, Landsat 7 experienced a scan line corrector malfunction, which influenced later acquisitions by introducing gaps with missing data in the scenes. However, field parcels in Finland are much smaller than this gap, and, therefore, no correction or filling of the gaps was performed. The gaps were interpreted as missing data for each field located within the gap.

Data preparation
All Landsat 7 scenes were processed to create a data set in the required format to train and test the classification models, by four steps: extracting image sequences, computing NDVI time series, aggregating NDVI across time and imputing missing data. Fig 5 schematically shows these steps, which are described in detail below.

Extracting image sequences.
For each field parcel, the boundary information from the reference data was used to extract the corresponding image segments from the raster files. An image segment consists of all pixels within the field-parcel boundary. If a field parcel was in two Landsat 7 tiles, only one was kept to avoid overlap. This yielded a sequence of images (of seven bands-the six spectral bands and the pixel QA band) that were further processed to discard invalid pixels using the QA mask. These were mainly cloud pixels.

Computing NDVI time series.
The multispectral images were used to compute Normalised Difference Vegetation Index (NDVI) band according to the formula: where ρ nir and ρ red are the pixel values of the near infrared (central wavelength 0.77-0.90 μm) and red (central wavelength 0.63-0.69 μm) bands, respectively. This process resulted in a sequence of NDVI images for each field parcel. From these (NDVI) image sequences, we get NDVI time series x 00 by taking the median (NDVI) pixel.

Aggregating NDVI across time.
The temporal resolution of x 00 refers to the frequency at which Landsat 7 scenes were captured. This is mainly a function of revisit frequency (16 days) of the satellite and cloud cover. As a result, time series x 00 of different fields have different lengths and their time indices are not aligned. To address these two problems, we perform temporal averaging as follows: First, we form a new time scale from 1 to 365 (corresponding to each day of a year) within which each time series x 00 is located based on the time of capture. Then, the new time scale is divided into d bins. The NDVI values within each  Fig 6, where red and blue lines represent fields with and without crop loss, respectively. It can be seen that, unlike a typical time series, there are many holes in x 0 . That is, even after temporal averaging the time series of each field parcel has many missing values. If we take the average of all the red and all the blue curves, then we see a general pattern of the aggregated NDVI time series for the two classes as shown in Fig 7. In each year, the topright value shows the Pearson correlation (referred to as NDVI-corr) between the red and blue curves, indicating a high correlation between them in each year. These high correlations imply the hardness of classifying the parcels into those with loss or without loss. NDVI-corr is used later in Section 3.2 for exploring the factors to explain classification performance.

Imputing missing data. 2.3.4.1 Missing data problem.
Missing data is a common problem when dealing with satellite images due to cloud cover and other data acquisition problems. This is especially problematic for northern countries like Finland due to the low sun angle. In ideal circumstances such as no cloud cover and no acquisition problems, the time series length is approximately 22 time steps (assuming an average revisit period is 16 days for Landsat 7). However, in our case, the average length of the time series is 4 due to missing data, i.e., around 82% of the data is missing. Fig 8 shows the missing data profile for different years. In some years e.g. 2003 and 2004, the problem is more severe where more than 90% of the data are missing. We observe that for all the years, the data in the beginning and end of the year are likely to be missing. This phenomenon can be explained by the low illumination angle during winter for which no surface reflectance product is processed. For barley, these missing data points during winter should have little effect since its heading time is around beginning of July, while maturity is reached around the middle of August in this part of Finland [34]. We also see that the pattern of missing data is different for the two classes in all years, suggesting that information on the location of missing values can help improving the classification.

Imputation methods for missing data.
We address the missing data problem in x 0 through mean imputation (Mean). The procedure is described using the following matrix notation for clarity. The time series x 0 of all the fields form a matrix X 0 where columns are the new time indices described in Section 2.3.3 and the missing entries correspond to the holes in the times series. These missing values are filled by the corresponding column mean of X 0 yielding the matrix X. Each row x i in the matrix X is the imputed time series of the field i.
Apart from mean imputation we also experiment with two other imputation methods: 1) missing data indicator (MI), and 2) multiple imputation by chained equation (MICE). MI generates a binary matrix M of the same size as the data matrix X, indicating the absence of a value. MICE is an iterative method which regresses each variable (column of X 0 ) over the other  Fig 2 for location of tiles).

PLOS ONE
in a round-robin fashion to compute the missing values [35]. These methods are compared in Section 3.1 to identify the best imputation strategy for classification. The imputation methods in Scikit-learn library [36] were used for the experiments. Note that due to the severity of missing data, time series interpolation methods to fill the missing values were not considered. This is because the time series were short (average length is 4) with many instances consisting of only one or two observations. In these cases interpolation is not meaningful. Instead different imputation methods were considered to fill the missing values.

Classification models
Given data set fD ¼ ðx i ; y i Þg n i¼1 with n observations, the task is to learn a model f: x 7 ! y such that p(y i ) = f(x i ;θ) where θ is a set of hyperparameters. We compared several classification models namely Logistic Regression (LR), Decision Trees (DT), Random Forest (RF) and Multilayer Perceptrons (MLP) [36]. Due to a large amount of missing data, which makes time series

PLOS ONE
Crop loss identification at field parcel scale very short, we model the time points as independent features, and did not consider time series models (such as autoregressive models or recurrent neural networks) in this work. All models are implemented using Scikit-learn (version 0.22.1) Python library [36] and the details of their optimisation and model comparison is given in Section 3.1.
To compare the models, we focused on the data from year 2015 because it has the least amount of missing data (see Fig 8). Also class imbalanceness in 2015 is relatively better than other years (see Table 1). We then created a balanced data set (with 13,152 fields), with all 6,576 crop-loss fields and the equal number of no-crop-loss fields (which were sampled out of all no-crop-loss parcels by means of undersampling). This balanced data set was used for cross-validation only for the model comparison.
Before comparing the models, we first optimised the hyperparameters of each model. Table 3 shows hyperparameters that were optimised, along with their respective ranges considered and the optimal values. The results are detailed in Section 3.1

Performance metrics
The performance measure used to evaluate the classification models was "area under the receiver operating characteristic (ROC) curve" (AUC) [38]. AUC takes a value in the interval

PLOS ONE
Crop loss identification at field parcel scale [0, 1], where a random classifier has a score of around 0.5 and a perfect classifier has score 1. AUC is insensitive to class imbalance which is important for this study as class imbalance is high. Further, it has a standard scale independent of the number of data points and the distribution of classes, so models trained on data from different years are directly comparable even though the number of loss and no-loss fields are different in each year. We used 10 × 10-fold cross-validation (CV) to compute the AUC of each model in all experiments (except in Section 3.3): In K-fold cross-validation first the data into K non-overlapping folds. Then the model is trained on K − 1 folds and tested on the remaining fold. This is repeated K times so that model is tested on each of the K folds. The model performance is given by average of the K AUC values. We used K = 10 and repeat 10-fold cross-validation 10 times with different random permutation of the data.

Within-year classification
Within-year classification situation is to determine if there was a crop loss in a field (for which the crop loss information was unavailable in a year), by using fields for which crop loss data are available in the same year. We used all data of each year for cross-validation, meaning training and test parcels being from the same year. The AUC value for each year is computed using the 10x10 cross-validation procedure described in Section 2.4 to compute the AUC values for each year.

Between-year classification
Collecting reference data is expensive so it would be useful to identify crop-loss fields in a year based on reference data from a different year(s). This between-year classification situation would be closer to future prediction of crop loss in a field, more than within-year classification. We considered two cases: single-year training and multiple-year training. For single year training case, all data from one year was used for training while all data from another year was used for testing. In the multiple year training case, test data are taken from one year and training data are taken from the remaining 15 years. For example, if test data is taken from 2015 then training data are taken from 2000 to 2014. Note that cross-validation was not used for between-year classification.

Model comparison
We first compared machine learning models and imputation methods, to find the most appropriate model and imputation strategy to be used throughout this work. As part of identifying the optimal imputation strategy, we also decide whether or not to include indicators specifying the locations of missing values as part of the input to the model. Table 4 shows AUCs for the different combinations of the model and imputation strategy. The method with the highest AUC was the combination of RF and Mean+MI (mean Table 4. Mean AUC of 10x10-fold CV for different models and imputation methods. imputation and missing value indicators), followed by the pair of MLP and Mean+MI. Table 5 shows computation time of different methods, indicating the significant computational advantage of Mean and Mean+MI over MICE. Focusing on Mean+MI, we further studied model comparison over all 15 years (see Appendix Table 5 in detail). From this result, although RF and MLP were overall the two most accurate methods, taking computation time into account, we concluded that the combination of RF and Mean+MI is the recommended model for crop-loss classification, and used this combination for the remainder of the experiments in this study.

Missing data profile correlations:
We further checked the similarity in missing data profiles of two classes by using MD-corr. Fig 9 plots AUC against MD-corr, showing that AUC decreases with MD-corr is increasing, meaning that not only the amount of missing data but also the pattern of missing data affects classification performance.
We thus can see NDVI correlation, missing data ratio and missing data profile correlation, are important factors in the data that affect classification performance. Also we hypothesise that similar missing data patterns may indirectly indicate similar weather conditions or geographical closeness of different fields, which might be useful for classification. We note that using missing data indicators as input to the classification model will be feasible in practice, as those will be available for the application at the same time as the satellite images themselves. However, missing data can be based on many reasons, such as cloud cover, data not processed to surface reflectance and scan line error so we cannot draw a causal relationship between the missingness pattern and crop-loss even when the classification accuracy is high.
3.2.1 Impact of field parcel area. We further examined the potential impact of field area on within-year classification, since larger fields are more likely to be crop-loss fields and this bias may affect classification performance. For this experiment, we focused on data from 2004, 2008, 2012 and 2015 (each had >3% crop-loss fields; see Table 1), to ensure that the class imbalance problem is not exacerbated when the data is divided based on the area. The field parcels are divided into three groups, depending on their area: small (< 1ha), medium (�1ha and <3ha), and large (�3ha). Table 7 shows the number of fields in the three groups for these four years. Table 8 shows the performance results, indicating that in each year, AUCs were approximately consistent with those obtained by using all data in Table 6 (2004: 0.602, 2008:    Table 9 shows sixteen AUC values (one for each test year) obtained by this procedure, along with the corresponding average and best AUCs of singleyear training. We can see several years in which multiple-year AUC can be better than the average single-year AUC, but for all years, multiple-year AUC is always worse than the best single-year AUC. This result implies that combining data from multiple years will not improve between-year classification.

Discussion
We have trained machine learning models to classify field parcels with and without crop loss, using NDVI values derived from Landsat 7 data. Several models were compared and tested in two different scenarios, namely within-year and between-year classification. The results showed that within-year classification is highly possible, while between-year classification is still hard. The resulting models have many applications, for example, they can be used by insurance companies or government agencies for verifying crop-loss claims. However, their performance will be contingent upon the availability of good reference data that captures all possible variations in environmental and biophysical conditions. Our study showed that a major challenge to improve the classification performance is the amount of missing data. Fig 9 shows the possibility of achieving high AUC if the missing data ratio was low. In our case more than 80% of data are missing and despite that the average within-year AUC = 0.688 with the possibility to increasing up to 90% when the problem is not severe. The issue might be improved by using newer spaceborne multispectral data, such as those from Sentinel 2 which has higher temporal and spatial resolution that could mitigate the effect of missing data. Furthermore, Sentinel 2 data can be combined with RADAR data from Sentinel 1 to get a denser time series and avoid occlusion from clouds to provide more detailed information about the growing pattern of agricultural fields. Obviously, such data are available only from the most recent years, and therefore cannot be combined with our crop loss data, which had covered many years but until 2015.
Between-year classification allows identification of crop-loss fields without any reference data in the same year so improving its performance would be a good future target. Our result in Section 3.3 implies that NDVI data alone might not be sufficient for improving the performance of between-year classification. Incorporating temperature and precipitation data along with the crop-loss reasons can give a more complete picture of the factors affecting the crop loss. Based on the promising results of within-year classification, augmenting high resolution satellite data with weather variables would be useful in achieving high performance for between-year classification. Another direction in which our work could be extended is the use of more flexible machine learning models. For example, convolutional neural networks could use the full image as input and recurrent neural networks could explicitly model the time dependency. These models have the potential to increase the classification performance. However, these models require a large amount of data to train properly, and cannot be directly applied to our data, which has a huge amount of missing data and consists of relatively short time series. The machine learning models in our study are rather simpler as they are more robust against the limitations in our data.

Conclusion
We have proved the feasibility of training a machine learning model to identify crop loss at field parcel scale using NDVI data derived from Landsat 7 satellite images. Experiments across sixteen years showed that field parcels from a given year can be classified into those with and without crop loss when the model can be trained on data from the same year. However, the ability to classify parcels from other years is limited. Missing data, which occupied more than 80% of our satellite images, deteriorated the classification performance. Preliminary analysis indicated that within-year classification performance can be improved if the missing data ratio was reduced. As long as we can assume that the NDVI times series of barley crop is similar with minor variation due to environmental and biophysical conditions then we believe that findings of this study can be generalised to barley fields in other countries.

Model comparison for all years
In Section 3.1, we performed model comparison on 2015 data to determine the best model. Table 10 extends these results to the other 15 years. Here all models were implemented with Mean+MI imputation strategy. We see that for all years, RF and MLP have have similar AUC values and outperform DT and LR. However, MLP was less robust that RF, i.e., it did not converge even after 500 iterations for several years (2005,2007,2009,2011,2013,2014) and took  Fig 11 shows the observed and predicted crop loss of 1000 field parcels for 2008. We choose 2008 because it has 1) 'sufficient' number of crop-loss parcels to divide into train and test sets, 2) relatively low missing data and 3) highest AUC. The following procedure was used to generate the plot. First data was randomly split into two sets: train and test. The test set consisted of 500 random crop-loss and 500 random no-crop-loss field parcels. All the remaining data formed the train set that was used to train the model (combination of RF and Mean+MI). Then the ROC curve of the trained model was analysed and the 'optimal' classification threshold determined to be the one that maximises the sensitivity and specificity of the model (also called Youden's J-index [39]). Finally, the trained model and the classification threshold was used to classify the field parcels in the test set. It should be noted that Youden's J-index is just one way to determine the optimal operating point on the ROC curve. An optimal point of a classifier is a trade-off between the cost of detecting false positives against the cost of failing to detect positive which is always application dependent.