Assessing alternative methods for unsupervised segmentation of urban vegetation in very high-resolution multispectral aerial imagery

To analyze types and patterns of greening trends across a city, this study seeks to identify a method of creating very high-resolution urban vegetation maps that scales over space and time. Vegetation poses unique challenges for image segmentation because it is patchy, has ragged boundaries, and high in-class heterogeneity. Existing and emerging public datasets with the spatial resolution necessary to identify granular urban vegetation lack a depth of affordable and accessible labeled training data, making unsupervised segmentation desirable. This study evaluates three unsupervised methods of segmenting urban vegetation: clustering with k-means using k-means++ seeding; clustering with a Gaussian Mixture Model (GMM); and an unsupervised, backpropagating convolutional neural network (CNN) with simple iterative linear clustering superpixels. When benchmarked against internal validity metrics and hand-coded data, k-means is more accurate than GMM and CNN in segmenting urban vegetation. K-means is not able to differentiate between water and shadows, however, and when this segment is important GMM is best for probabilistically identifying secondary land cover class membership. Though we find the unsupervised CNN shows high degrees of accuracy on built urban landscape features, its accuracy when segmenting vegetation does not justify its complexity. Despite limitations, for segmenting urban vegetation, k-means has the highest performance, is the simplest, and is more efficient than alternatives.


Introduction
In many cities, urban vegetation is changing with the adoption of green infrastructure programs. In some cases, these are multibillion-dollar programs [1] that subsidize or create incentives for installing green infrastructure (e.g., parks, bioswales, street trees, and rain gardens) or replacing impervious surfaces (e.g., asphalt) with pervious ones (e.g., grasses). Despite scale and cost, however, there is often little empirical evaluation of programs due to challenges in data quality and attribution. With very high-resolution maps of vegetation structure through a city it would be possible to ask questions like: how are greening programs changing the quantity and type of green within the city? how does an individual parcel-owner's participation in a greening program relate to neighborhood landscape change? how are greening programs driving ecological co-benefits, like habitat creation? To track patterns of urban greening across a city, including differential dynamics on public and private lands, this study examines if it is possible to leverage very high-resolution remotely sensed aerial imagery. Most existing studies evaluate urban greenness at spatial resolutions coarser than the coverage of urban green infrastructure. Many use Landsat imagery at a spatial resolution of 30 meters or MODIS imagery at resolutions of 250-1000 meters [2][3][4][5][6]. Zhou et al. [7] find that Landsat underestimates vegetative cover relative to 2.5-meter SPOT/ALOS by approximately 10-25%, however, which they attribute to small changes in land cover. These small changes are typical of urban greening programs. Myint et al. [8] describe, to classify objects like landscape features, the spatial resolution of the imagery "needs to be at least one-half the diameter of the smallest object." Identifying small changes in vegetation calls for spatially high-resolution imagery, but using commercial imagery can be expensive to scale across a broad spatial extent or over time. Commercial imagery providers also often lack a historical archive, limiting possibilities for timeseries analysis. Among publicly available, time-series data products in the United States, the broadband multispectral aerial imagery from the National Agricultural Imagery Program (NAIP) has the highest spatial resolution (0.6 or 1.0 meters). NAIP is the only publicly available data product with a spatial resolution sufficiently granular to delineate many urban vegetation features.
Thus far, there have been few studies that exclusively use the NAIP data product to identify urban vegetation without supplementary data. One notable exception is Basu et al. [9], who use NAIP spectral signatures with Conditional Random Fields, an unsupervised, machine learning technique, to identify tree canopy. They achieve a 74% accuracy across the state of California, spanning both urban and exurban spaces.
Within the urban environment, object-based landscape classification has been shown to resolve vegetation more accurately than pixel-based classifiers. After implementing an objectbased classifier on 2.4-meter Quickbird imagery, Myint et al. [8] see classification accuracy improve over 20% (to 90.4%) compared to a pixel-based classifier. Mathieu, Freeman, and Aryal [10] find that using an object-based classifier and 2.5-meter IKONOS imagery results in a classification accuracy of 90.7%. Combining multiple imagery sources and fuzzy objectbased classifiers, Li et al. [11] classify urban vegetation with 97.37% accuracy. Using NAIP imagery, Li et al. [12] create a 1-meter resolution land cover map with an object-based classifier supplemented with GIS parcel data (including land use attributes) for an overall accuracy of 91.86%. O'Neil-Dunne et al. [13] combine NAIP with LiDAR and thematic GIS data to identify urban tree canopy with up to 98% accuracy. Maxwell et al. [14] combine NAIP, derived spectral data, and thematic GIS data to map land cover types at 96.7% accuracy.
Using a supervised classifier and NAIP imagery, Robinson et al. [15] made the first highresolution, nationwide land cover map. They combine 1-meter labeled NAIP data from the Chesapeake Bay, Maryland, 30-meter Landsat imagery and 30-meter National Landcover Dataset labels. Across all land covers in two test locations in Maryland and Iowa, they achieve a mean accuracy upwards of 90%.
While object-based and supervised learning methods are the gold standard for image classification, the challenge is that they require abundant training data. Sometimes these are handengineered features, but most useful is labeled imagery that has the same spectral and spatial resolution as the studied imagery. Acquiring good quality, labeled training data can be prohibitively expensive and time consuming. Robinson et al [15] note that the labeled NAIP data used as the basis of their study took 10 months to generate and cost $1.3 million. The burden of requiring training data means that supervised learning methods are difficult to implement on less commonly used or emerging datasets. With the ongoing proliferation of new cameras, sensors, and datasets, there will always be a shortcoming of labeled data. As a result, unsupervised learning methods will continue to be a critical complement to supervised methods, even as libraries of training data grow.
We evaluate unsupervised methods for segmenting sub-parcel urban vegetation. We start with spectral clustering methods: k-means with k-means++ seeding and a Gaussian Mixture Model (GMM). Then, we use a backpropagating convolutional neural network (CNN) with simple iterative linear clustering (SLIC) superpixels. While CNNs are frequently used in supervised learning problems [16], they are less common in unsupervised learning despite great potential [17,18]. We compare k-means, GMM, and CNN segmentation by benchmarking classified data against internal validity metrics and hand-coded data.

Study area and data
The NAIP imagery evaluated in this study represents a region of approximately 40 square kilometers in west Philadelphia, Pennsylvania, USA (Fig 1). The area includes urban land uses (commercial, industrial, and residential) and undeveloped regions (parks, grassland, unmanaged soil, and open water), providing a diversity of land uses and land cover classes.
The image was acquired on August 18, 2015. It has a spatial resolution of 1 meter and four channels: blue (435-495 nanometers), green (525-585 nanometers), red (435-495 nanometers), and near infrared (808-882 nanometers). The radiometric resolution of the datasets is eight bits and the data are radiometrically calibrated, but not atmospherically corrected. Maxwell et al [19] review NAIP imagery in the context of land cover classification, noting that one of the most significant challenges when using NAIP is inconsistent digital numbers across images. We focus our study on one image, which we subdivide into 800x800 pixel tiles to improve processing times.
We examine the data through its four native channels and six indices shown in the literature to effectively identify vegetation ( Table 1). The Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) enhance the vegetation signal. The Soil Adjusted Vegetation Index (SAVI) and Crust Index (CI) minimize background soil effects. The Atmospherically Resistant Vegetation Index (ARVI) and Visual Atmospheric Resistance Index (VARI) reduce atmospheric and topographic effects. With the indices, the final dimensions of the image are 7200 rows x 5600 columns x 10 channels.

Methods
The analysis proceeds in three steps. First, we mask out impervious surfaces (e.g., roads and buildings). Then, focusing on the pervious regions, we further segment the image. We apply kmeans, GMM, and a CNN to the NAIP images, creating five major segments. We assign a land cover class to each segment: (1) water/shadow, (2) soil/ground, (3) shrub/tree, (4) grass, and (5) other. Finally, we evaluate the segmentation patterns for accuracy.

Masking
To reduce the risk of spurious associations in the vegetation analysis, we mask out impervious surfaces. We use k-means clustering [26] with k-means++ seeding [27] to create two clusters based on NDVI values. One of the drawbacks of k-means is it requires an initial random seeding to place the first iteration of centroids from which to calculate the within-group and between-group variances. It is possible that the centroids may settle at local minima instead of the global minimum, resulting in poorly defined clusters. For more robust centroid seeding, rather than placing all centroids simultaneously, the k-means++ algorithm selects one centroid at random and then allocates subsequent centroids based on the prior centroids, given the first one. It is still subject to error, but less so than traditional k-means seeding. To model differences between traditional seeding and k-means++, we use an inertia score, which is the sum of squared distances to the closest cluster. We find that k-means++ leads to a lower inertia score, indicating a better fit (results available upon request).

Segmentation
Clustering with k-means. We use k-means with k-means++ seeding to define five hard clusters based on the original four bands of the image and six vegetation indices. K-means scales efficiently to higher dimensionality and additional channels present little computational burden.
We implement k-means in python using the PySpark library, a big data processing framework. We define three parameters: five clusters, a Euclidean distance metric, and a maximum of twenty iterations.
Clustering with GMM. We use GMM to define five soft clusters based on the ten channels. Soft clustering calculates the probability that each pixel falls within all clusters [28]. GMM conveys some advantages over k-means. Firstly, its cluster definitions are less vulnerable to outliers. Like k-means, centroid placement with GMM is iterative. Unlike k-means, GMM uses all pixel information to update centroid locations, not just the information within a given cluster. Secondly, because GMM assigns each pixel to all classes with a probability of membership, the secondary class membership may be useful in some image interpretation problems. In particular, examining the second-class membership after water/shadow may be a useful method for "seeing under" or removing shadows in some environments [29]. The drawback, however, is that GMM is more computationally intensive and does not scale well to high dimensionality.
We implement GMM in python using the PySpark library. We define five clusters, use posterior probability as the convergence metric, and set the maximum number of iterations to twenty.
Using a CNN to delineate vegetation types. Instead of defining class membership through clustering, the CNN assigns spatially continuous pixels and pixels associated with similar features to the same class, while minimizing the number of unique classes. We adapt the algorithm proposed by Kanezaki [18] for a backpropagating CNN, which is specifically designed for unsupervised segmentation. Fig 2 describes the segmentation of the image as it moves through the CNN. In Stage 1, the CNN applies rectified linear unit and batch normalization operations to each pixel in the input image (Stage 1, top), transforming it to a feature map (Stage 1, bottom). The CNN does this by first assigning a multi-dimensional vector to each pixel, then using an argmax classifier to assign a cluster label to each pixel based on the largest feature index. In Stage 2, the CNN applies SLIC superpixels to the feature map developed in the first stage, assigning each pixel the same label as the majority within its superpixel. This produces a new segmentation map (Stage 2, bottom). In Stage 3, the CNN calculates errors, which are propagated back through the network. The CNN repeats Stages 1-3 iteratively, reducing the number of cluster groups (initially, approximately 50) to five (Stage 3, bottom).
We run the CNN in python using the PyTorch library and the skimage library's implementation of SLIC. We parameterize the network architecture for the CNN to comprise three convolutional layers with each layer consisting of 100 convolutional filters. We set the learning rate to 0.1, Stochastic Gradient Descent as the optimizer, and the momentum parameter to 0.9. We use cross-entropy loss as the optimization function for the CNN. For the SLIC algorithm, we set the number of superpixels to 10,000 and the compactness of each superpixel to 10.

Evaluation
Segment uniformity and disparity. Unsupervised learning is most useful when there is inadequate training data, but this lack of labeled data also makes it challenging to evaluate algorithmic performance. To overcome the problem of labeled data, Zhang et al. [30] use internal image-segment validity metrics to evaluate segmentation quality. They suggest four criteria: 1. Regions should be uniform and homogeneous with respect to some characteristic(s); 2. Adjacent regions should have significant differences with respect to the characteristic on which they are uniform; 3. Region interiors should be simple and without holes; 4. Boundaries should be simple, not ragged, and be spatially accurate.
The first two criteria, "characteristic" criteria, relate to the structure of segmented objects, while the last two, "semantic" criteria, relate to object legibility. Unlike characteristic criteria, semantic criteria are application-or object-dependent. Neither of the semantic criteria is applicable to vegetated land cover, so we only consider characteristic criteria.
The Davies-Bouldin Index [31] evaluates both characteristic criteria: intra-cluster uniformity and inter-cluster disparity. It is an internal evaluation scheme, where the validation of how well the clustering has been done is based on quantities and features inherent to the dataset. We apply the DBI to the k-means, GMM, and CNN image segmentations.
Benchmarking against classified data. To compare the output from the unsupervised image segmentation methods against labeled data, we hand-code a set of pixels. We label three tiles, or 1,920,000 pixels, as belonging to one of the five land cover classes (water/shadow, soil/ ground, shrub/tree, grass, and other). Then, we generate evaluation metrics and confusion matrices to gauge the performance of each segmentation method for each landscape type.

Masking
The initial masking step removes all impervious surfaces and highly reflective pure water (not mixed with soil or vegetation) from the sample (Fig 3). All vegetation, ground/soil, and most water/shadow remain. Fig 4 presents segmentation results. K-means and GMM exhibit different segmentation patterns, with GMM emphasizing grass and identifying some river water as ground/soil. CNN produces a segmentation pattern very different from the other two with far fewer small segments. It does not successfully represent objects like trees in the middle of grass or slivers of ground/soil.

PLOS ONE
Benchmarking against classified data. Compared to the hand-coded data, k-means image segmentation outperforms both GMM and CNN, though there is variability from tile to tile (Table 3). K-means has the strongest performance on all the tiles, with an average accuracy of 80%. It performs best on Tile 3 (81%) and worst on Tile 1 (79%). Conversely, GMM performs best on Tile 1 (55%) and worst on Tile 3 (45%). Like k-means, CNN performs its best (69%) on Tile 3 but is only 28% accurate on Tile 2.
Not only is there variation in accuracy by tile, there is also variation by land cover class. Evaluation metrics by land cover class (accuracy, recall, precision and F1 scores) are available in Table 4. K-means classifies vegetation the best among the three methods, with grass at 80% accuracy and trees/shrubs at 78%. GMM is the weakest performing method, with grass at 26% accuracy and trees/shrubs at 55%. CNN segments grass at 74% accuracy and trees/shrubs with 49%. Confusion matrices, available in Table 5, reveal which classes the segmentation algorithms regularly mistake. For example, k-means regularly identifies water/shadow as grass, but GMM never does.

Discussion
The unsupervised segmentation methods evaluated in this study do not provide the same degree of accuracy for segmenting urban vegetation as has been achieved with some objectbased and supervised methods [8,10]. Even so, unsupervised k-means, GMM, and CNN methods are reasonable approaches for classifying some landscape features to create low-cost, scalable vegetation maps without training data. In particular, in this study area, k-means delineates urban vegetation with surprising accuracy compared to more complex alternatives ( Table 4).
One of the challenges of the segmentation problem is that many pixels in the study area likely represent multiple land cover types. This is exemplified by the pixels in the ground/soil and water/shadow class, which could have been dropped in the masking step if the spectral signature of the land cover included in the pixels aligned more closely with impervious surfaces and pure water. Instead, k-means finds the spectral signatures of these pixels align better with vegetation. It is possible this is partly due to errors in evaluating accuracy; some of the handcoded pixels may be misclassified as ground/soil since pixels representing dormant (nonphotosynthetically active) vegetation, dead vegetation, and soil are often spectrally similar [12,  32]. However, the pixels in the middle of the river that are not dropped in the initial masking (Fig 3, Tile 3) illustrate the mixed pixel problem most clearly. We believe these pixels are not masked from the analysis because there are degrees of both sedimentation and eutrophication present in the river. Immediately downstream of the study area is an old, low-head dam (Fig 3, visible in the lower left-hand corner of Tile 3) and the right bank of the river is heavily Table 4. Evaluation metrics. Accuracy is the true positives (correct predictions) over the total number of predictions. Recall (user's accuracy) is the fraction of true positives over the sum of true positives and false negatives. Precision (producer's accuracy) is the fraction of true positives over the sum of true positives and false positives. The F1-score ranges from 0-1, where 1 is the optimal balance of recall and precision. populated with boaters. The water in this area likely contains solutes and material that make it more spectrally similar to ground/soil than reflective, pure water. In another study area, some of these mixed water pixels would be grouped into the initial impervious/pure water cluster that was dropped in the masking step; the initial masking is sensitive to the boundary conditions of the study area and mixed pixels align with a class based on the full distribution of spectral values in the study area. Water and shadows are grouped together as a class because they are difficult for k-means and CNN to distinguish. Spectrally, the dark color of the water with its underlying vegetation signals is very similar to shadowed vegetation. Though classifying water is not a primary focus of this study, separating the two would help identify the types of vegetation under shadow. To further evaluate the water/shadow segment, we visualize the second most probable class with The first column shows the original imagery, the second column identifies water/ shadows (blue), and the third column identifies the second most likely land cover class after water/shadows for each pixel: ground/soil (pink), trees/shrubs (orange), or other (white). Grass is never initially identified as water/shadow. NAIP Digital Ortho Photo Image courtesy of USDA-FSA-APFO Aerial Photography Field Office.
https://doi.org/10.1371/journal.pone.0230856.g005 GMM (Fig 5). Because GMM is a soft clustering method, even though a given pixel may most likely be within the water/shadows class, it will also be within all other land cover classes with some probability. The secondary land cover class gives insight into the land cover under shadow or mixed in the water. In Fig 5, GMM shows the secondary classes to be a combination of ground/soil and trees/shrubs. In the river in Tile 3, the secondary class is exclusively ground/soil.
To better understand the spectral channels driving the segmentation pattern in the clustering methods, we evaluate the relative contribution of each of the ten channels through a Gini index ( Table 6). The Gini index gives a coefficient to each channel from zero to one. If all channels are equally represented in the final segments, the Gini coefficients would be zero. If a channel is more likely to be contained in a final segment, the coefficient approaches one. Among the channels evaluated in this study, NIR is the most important driver of the segmentation pattern, followed by green, red, and blue. All vegetation indices are much less important.
The Gini Index ranks the contribution of each channel to the final classification, given all channels present. Changing the set of available channels will change the Gini Index values. With the available channels, the Gini index indicates that segmenting through multi-dimensional clustering is an improvement over classifying through a single vegetation index, like NDVI. There remains opportunity to incorporate additional spectral channels and derived channels that are uncorrelated with RGB and NIR, like measures of texture, depth, or pixel neighborliness, to continue to improve classification accuracy of the clustering methods.
Using the same ten channels, GMM performs worse than k-means. While k-means has a mean classification accuracy of 80%, GMM has a mean accuracy of 48% (Table 3). The GMM results have lower DBI scores than k-means, too. Further, GMM is much more computationally intensive than k-means and does not scale efficiently. Its only advantage is its ability to detect secondary class membership.
The CNN, with a mean classification accuracy of 49%, slightly outperforms GMM, but performs poorly compared to k-means. This is unexpected, given the accuracy of many reported CNN implementations. To further evaluate CNN performance, we evaluate k-means, GMM, and the CNN against an existing labeled, very high-resolution urban land cover dataset [34] with three spectral bands (R, G, B) and find that the CNN (82.42% accuracy) performs significantly better than GMM (49.39% accuracy) and k-means (43.73% accuracy) when examining built urban features ( Table 7).
The much higher accuracy on built urban features is consistent with the more common use of CNNs for delineating constructed urban features, like roads and buildings [35,36], which have higher internal homogeneity and stronger edges. One of the features of many CNNs, including the one in this study, is the use of SLIC superpixels. The superpixels force internal homogeneity onto the segments. Internal homogeneity is usually a desirable attribute of a segment because it reduces salt-and-pepper effects, but this is less applicable to patchy vegetation.
In contrast, k-means or GMM does not account for neighborliness when assigning clusters, allowing for salt-and-pepper effects. While the CNN shows far better accuracy on built urban features with three-band imagery, for urban vegetation on multispectral imagery, the backpropagating CNN with SLIC superpixels does not offer sufficiently better performance to justify its much more complex architecture. One of the most significant challenges of this study is defining the best evaluation criteria with which to compare algorithmic performance. There is still further work to be done on developing evaluation criteria for unsupervised segmentation and classification. DBI and benchmarking against hand-coded data sometimes give contradictory results. For k-means, DBI indicates that its best performance is on Tile 3, while benchmarking indicates its worst performance is on Tile 3. For GMM, DBI results indicate that it performs very poorly on Tile 2, but benchmarking indicates its performance is on par with the other tiles. Because DBI measures segment uniformity and disparity, it may not be an adequate metric in the context of complex, heterogeneous urban vegetation, where segments may be neither uniform nor distinct.
Yet, despite shortcomings of DBI, an equally significant weakness is generating accurate labeled data for benchmarking. When hand-coding data, we can only visualize three bands simultaneously. A false-color composite can be used to visualize NIR, which the Gini Index indicates is very important in determining class membership (Table 6), but it is still difficult to visually separate all land cover types. While the advantage of hand-coding is that we implicitly incorporate shape into object labeling, perceiving objects like trees, some variation (e.g., between grass and soil) is more difficult to visually distinguish. Further, NAIP is still sufficiently coarse that it is difficult to distinguish objects less than 2 meters in diameter (following Myint et al [8]'s discussion of image resolution and object recognition). As a result, the labels we create for the data may not be perfectly accurate. This issue is echoed in a study from the California Data Collaborative [37], which compared commercially classified residential landscape data to hand-coded NAIP imagery, finding low rates of agreement among algorithms and hand-coding. But the California Data Collaborative also found low rates of agreement among different hand-coders reviewing the same parcels. The discrepancy between handcoded data reveals that even though NAIP imagery offers the highest spatial resolution for publicly available data, it is coarse enough that it can be difficult for hand-coders to reliably classify parcel-level land cover. Physically ground-truthed data would provide a better benchmark but is difficult to scale. To further evaluate algorithmic performance, segmentation and classification should be tested across a greater variety of biophysical and urban conditions. It is likely that each method will be useful in certain contexts, when segmenting particular urban landscape conditions. Additional channels may improve k-means and GMM performance. Alternative CNN architectures and parameters may improve CNN performance.
It is possible that the most efficacious unsupervised segmentation and classification method for identifying small and patchy urban vegetation will combine aspects of all three algorithms evaluated in this study. There is opportunity to consider how principals of GMM (e.g., "seeing" under shadows) and k-means (e.g., ragged boundary definition through many channels) could be brought into a CNN architecture. Evaluating algorithmic performance will remain an ongoing challenge, however, especially in shadowed regions. It is likely that data fusion will be necessary, up-sampling or down-sampling existing labeled data [15,38]. If high-quality labeled data is available, however, it would be better to use supervised methods. It will always be most useful to apply unsupervised methods when there is an insufficient amount of labeled data.

Conclusion
Though supervised learning methods are currently popular and can be very accurate for delineating some urban features, the burden of training data constrains the use of less common or emerging data sets with no available training data. We examine unsupervised segmentation of urban vegetation features from NAIP multispectral aerial imagery. We use two clustering methods, k-means and GMM, and a backpropagating CNN with SLIC superpixels. Each method has advantages: k-means with k-means++ seeding has the highest levels of segmentation accuracy for most landscape types and is both easy and efficient to implement; GMM provides the best information through shadowing; and the CNN, while underperforming on urban vegetation, has high accuracy on built features with hard edges and internal homogeneity. Yet, for segmenting urban vegetation from multispectral imagery without training data, kmeans is the most accurate, simplest, and most efficient.