Comprehensive marine substrate classification applied to Canada’s Pacific shelf

Maps of bottom type are essential to the management of marine resources and biodiversity because of their foundational role in characterizing species’ habitats. They are also urgently needed as countries work to define marine protected areas. Current approaches are time consuming, focus largely on grain size, and tend to overlook shallow waters. Our random forest classification of almost 200,000 observations of bottom type is a timely alternative, providing maps of coastal substrate at a combination of resolution and extents not previously achieved. We correlated the observations with depth, depth-derivatives, and estimates of energy to predict marine substrate at 100 m resolution for Canada’s Pacific shelf, a study area of over 135,000 km2. We built five regional models with the same data at 20 m resolution. In addition to standard tests of model fit, we used three independent data sets to test model predictions. We also tested for regional, depth, and resolution effects. We guided our analysis by asking: 1) does weighting for prevalence improve model predictions? 2) does model resolution influence model performance? And 3) is model performance influenced by depth? All our models fit the build data well with true skill statistic (TSS) scores ranging from 0.56 to 0.64. Weighting models with class prevalence improved fit and the correspondence with known spatial features. Class-based metrics showed differences across both resolutions and spatial regions, indicating non-stationarity across these spatial categories. Predictive power was lower (TSS from 0.10 to 0.36) based on independent data evaluation. Model performance was also a function of depth and resolution, illustrating the challenge of accurately representing heterogeneity. Our work shows the value of regional analyses to assessing model stationarity and how independent data evaluation and the use of error metrics can improve understanding of model performance and sampling bias.


examples.
This statement is required for submission and will appear in the published article if the submission is accepted. Please make sure it is accurate.
Unfunded studies Enter: The author(s) received no specific funding for this work. If the data are held or will be held in a public repository, include URLs,

11
Maps of bottom type are essential to the management of marine resources and biodiversity 12 because of their foundational role in characterizing species' habitats. They are also urgently 13 needed as countries work to define marine protected areas. Current approaches are time 14 consuming, focus largely on grain size, and tend to overlook shallow waters. Our random forest 15 classification of almost 200,000 observations of bottom type is a timely alternative, providing 16 maps of coastal substrate at a combination of resolution and extents not previously achieved. 17 We correlated the observations with depth, depth-derivatives, and estimates of energy to predict 18 marine substrate at 100 m resolution for Canada's Pacific shelf, a study area of over 135,000 19 km 2 . We built five regional models with the same data at 20 m resolution. In addition to standard 20 tests of model fit, we used three independent data sets to test model predictions. We also tested 21 for regional, depth, and resolution effects. We guided our analysis by asking: 1) does weighting 22 for prevalence improve model predictions? 2) does model resolution influence model 23 performance? And 3) is model performance influenced by depth? All our models fit the build data 24 well with true skill statistic (TSS) scores ranging from 0.79 to 0.82. Weighting models with class 25 prevalence improved fit and the correspondence with known spatial features. Class-based 26 metrics showed differences across both resolutions and spatial regions, emphasizing the value 27 of disaggregated metrics to understanding process stationarity. Predictive power was lower (TSS 28 from 0.53 to 0.68) when models were evaluated with independent data sets, emphasizing how 29 this is different from model fit. Model performance was a function of both depth and resolution, 30 Understanding how marine species are distributed is central to ecosystem-based fisheries 35 management, vulnerability assessments, protected area design, and other marine spatial 36 planning activities such as aquaculture siting and oil spill response. This understanding relies on 37 models of habitat suitability [e.g., 1, 2], the credibility of which depends in large part on the 38 accuracy of the underlying environmental predictors. While often described as habitat in the 39 marine classification literature [e.g., 3], substrate is more accurately described as a key 40 determinant of habitat for benthic species [4,5]. However, unlike predictors of ocean dynamics or 41 chemistry (now regularly obtained via remote sensing or ocean circulation models), reliable 42 descriptions of bottom type continue to elude researchers due to both sampling and analytic 43 challenges. As countries continue to increase their protection of marine spaces [6, 7], effective 44 management will require this knowledge gap to be filled efficiently and quantitatively [8]. 45 In this contribution, we used a random forest classification to develop a substrate model for 46 Pacific Canadian shelf waters at two spatial resolutions (20 m and 100 m). We extended the 47 7 study area is about 1/5 th the size but with 25x better resolution, giving a data footprint (75 million 151 pixels) 5 times larger. Our 20 m regional models range from 91 to 470 million pixels. 152 We built a 100 m (100 x 100 m 2 ) coastwide model for the Canadian Pacific continental shelf. To 224 improve the quality of habitat models for species close to shore and extend the predictions 225 across the white strip, we built 5 nested regional, nearshore models at 20 m resolution (20 x 20 226 m 2 ) ( Fig. 1). We also built paired models to test for the effect of class prevalence. This collection 227 of models allowed us to examine the relative performance of models across resolutions, regions, 228 and depths. 229 We tested model fit by partitioning the data used to build the models (the build data, Table 1) into 234 training and testing partitions. We tested the predictive power of our models using three 235 independent data sets, collected separately. Our predictor variables (Table 2)     All observational and predictor data were prepared using ArcGIS [49] before importing into R, 251

Sampling
where we joined all observational data with predictors scaled to coastwide (100 x 100 m 2 ) and 252 regional (20 x 20 m 2 ) grids (i.e., rasters) before analysis. Predictors were not transformed or 253 assessed for correlation since the random forest approach is largely robust to non-normal, 254 correlated predictors [44,45]. Where multiple observations occurred in the same raster cell, the 255 predictor values were duplicated to preserve the observational sample size. This was more 256 common for the coastwide model. We used the same number of trees (1000) and test fraction (0.6) for all ranger models. We set 268 variable importance to use the Gini index for classification, and the internal cross-validation to 269 sample with replacement. We applied the standard data partitioning approach of dividing the 270 build data into training (67%) and testing (33%) partitions. We used the same training and testing 271 partitions to build all the models, assess model fit, and explore the effects of depth and model 272 resolution. Models were further assessed using IDE to test predictive power, examine sample 273 bias, and test for stationarity. 274 14 Addressing our objectives 275 The effect of class prevalence on habitat models has been known for some time [e.g., 52, 53]. It 276 is also a significant area of research in the machine learning community [51], where well-277 balanced classes are encouraged. Finding little on this topic in the marine substrate classification 278 literature, we tested for this effect by constructing a parallel set of models with and without class-279 size weighting. We weighted classes according to their prevalence in the training data, and 280 compared their performance using both the build testing sample and the independent data sets. 281 To assess if different classes were more reliably predicted with class weighting, we compared 282 class-based statistics across all models. We also considered whether any observed differences 283 were influenced by model resolution, as well as by region and depth. 284 It is reasonable to expect that physical and ecological processes will differ across regions with 285 different spatial contexts. However, the detection and interpretation of this non-stationarity is 286 rarely done. We tested whether our models performed better in some regions than others, and 287 whether this was affected by model resolution. In particular, we were interested in whether a 288 single coastwide model would perform similarly across large, physiographically distinct regions, 289 and whether model resolution influenced performance across depths. This type of spatial 290 analysis lends itself to tests for ecologically relevant spatial structure [40], and we use it here to 291 assess how model performance might differ across space. 292 We compared model performance across regions and depth classes. To test for stationarity at 293 the regional scale, we examined how the coastwide model performed in each of five regions (Fig.  294 1). These ranged from the sheltered, largely muddy Strait of Georgia (SOG) dominated by 295 siltation from the Fraser River, to the exposed West Coast of Vancouver Island (WCVI), and the 296 oceanographically distinct islands of Haida Gwaii (HG). The North Central Coast (NCC) includes 297 deep fjords and inlets as well as a large exposed coastline, while the transitional Queen 298 15 Charlotte Strait (QCS) region includes sheltered and exposed areas, fjords and inlets. The many 299 fjords are effective sinks for fine sediment and limit its distribution across the shelf [54]. 300 We tested our hypothesis that model performance is correlated with depth by comparing model 301 fit and predictive performance across a range of depth classes. We classified depths following 302 Model evaluation 306 We first assessed how well the models explain the testing partition of the build data (i.e., the 307 model fit), and compared the weighted and non-weighted models to illustrate the effects of class 308 prevalence. We used the same training and testing data partitions to test the fit of all models to 309 standardize model comparison: because the testing partition has the same context as the 310 training data, it can be considered unbiased and fully stationary test sample. These results thus 311 provided a baseline for assessing the effects of sampling bias during the subsequent IDE to 312 assess model predictive power. IDE is recognized as the gold standard for evaluating model 313 We use Normalized Accuracy and True Negative Rate (TNR) to provide information on false 326 positives and false negatives. These metrics of better-than-random performance are relative to a 327 no-information baseline (see Supplementary Methods S1). 328 We also report the Accuracy and the True Negative Rate ( We report these metrics both in aggregate (combined across classes), and at a class-based 337 level. This combination of accuracy and error assessment provides a more complete picture of 338 model performance than is commonly provided in such analyses. Finally, we offer Imbalance as 339 an integrated measure of prevalence in a multi-class data set. See Supplementary Methods S1 340 for additional details. 341 To complement the quantitative assessment, we examined the spatial agreement of our 342 predictions with two well-known areas of the Pacific coast, Pacific Rim National Park Reserve in 343 the WCVI region, and English Bay, part of the urban coast of Greater Vancouver in the SOG 344 region. This qualitative comparison adds valuable spatial context to the assessment of the 345 substrate predictions by allowing the patterns produced by different models to be compared. 346 Model build data 347 We assembled a coast-wide data set of 197,587 observations (Table 1) Table S1. 359 Predictor data 360 Environmental predictors were selected to include a combination of benthic terrain features 361 derived from bathymetry [e.g., 15, 58] and measures of energy [e.g., 2, 59]. To support our 362 analysis across two spatial resolutions, we derived the same terrain features from a 100 m 363 coastwide bathymetry and our 20 m regional bathymetries. These included slope, curvature, 364 rugosity, the standard deviation of slope, and three bathymetric positioning indices (BPIs) with 365 increasing neighborhood sizes (Table S2)  Independent evaluation data 379 We sourced independent data from three sets of surveys with different passive observation 380 techniques: dive surveys, drop cameras, and remotely operated video (ROV). All three datasets 381 had consistently coded substrate classes making it easy to reclassify them to our four substrate 382 classes (Table S3) (Table S1).

403
Our results show how class weighting to address sample imbalance can lead to both numerical 404 and spatial improvements in model performance. We also confirm the existence of regional non-405 stationarity, and show that model performance is correlated with depth. The results are 406 presented in detail in the following sections. 407 Model development and class weighting 408 All models, weighted or not, had comparable and high fit to data (TSS values ranging from 0.78 409 to 0.82) with no notable differences between the coastwide and the regional models in the 410 aggregated metrics (Table 3). Accuracy was consistently lower than TNR; however, this is 411 expected given the different, no-information baselines. The benefits of weighting for prevalence 412 begin to emerge when the class-based metrics are examined (Fig. 2). Comparing the effect of 413 weighting using class-based metrics (Fig. 2) showed Accuracy shifting away from Rock to the 414 other classes, particularly Mixed and Mud, for all models. There was a corresponding increase in 415 the Reliability score of the Rock class at the expense of the other classes, though this was more 416 variable across models. The TNR changed less than Accuracy in response to weighting, 417 20 contradicting the observation by Allouche et al. [52] that prevalence has a greater influence on 418 TNR than Accuracy. Across classes, Rock and Mud had the highest Accuracy and Reliability 419 scores and the Mixed class had the lowest across all models, regardless of weighting (Fig. 2), 420 confirming that this class gives the models the most difficulty. However, its higher TNR shows it 421 was rarely predicted in error, perhaps in part due to its relatively small sample size (Fig. 3, Table  422 S3). 423 Reliability for weighted (left column) and non-weighted (right column) models. The color shading 432 within each row reflects the underlying numbers from high (red) to low (blue) and is included to 433 enhance the differences between values. 434 The effect of weighting was notable in the error assessment where the Quantity errors of all non-435 weighted models (0.07 to 0.10) were about twice that of the corresponding weighted models 436 (0.03 to 0.06) ( Table 3). The majority of model error came from Exchange between classes. The 437 reduction in Quantity error achieved by weighting tended to be offset by a corresponding 438 increase in Exchange error. This explains why TSS and Accuracy were largely unchanged by 439 weighting (Table 3). 440 The reduced Quantity error due to class weighting was also evident when the prevalence of the 441 build testing partition was compared to model predictions (Fig. 3). The over-prediction of the 442 Rock class by the non-weighted models, as well as the consistent under-prediction of the Mixed 443 class, were clearly mitigated by class weighting. Changes in the prevalence of the Sand and Mud 444 classes were more equivocal. More importantly, these changes in predicted prevalence were 445 evident in small but significant shifts in the spatial distribution of the classes, with the weighted 446 model producing a pattern that more accurately reflected known nearshore substrate in our two 447 test locations (Fig. 4). The predictions from the weighted model produced less Rock and more 448 Sand on known beaches in Pacific Rim. Changes in English Bay were less apparent, but a shift 449 from Rock to Mixed is evident. In deeper waters of Pacific Rim, the weighted model predicted 450 less Sand and more Mixed substrates, although bathymetric artefacts were enhanced. 451  Variable importance (Fig. 5) differed notably across models despite similarities in the aggregate 461 metrics (Table 4), suggesting different processes were at play. Our indicators of ocean dynamics 462 (circulation and tidal) were the top two predictors for the coastwide model, followed by 463 bathymetry. Slope, broad-and medium-BPI, and the standard deviation of slope provided almost 464 equal contributions. In contrast, bathymetry was the dominant predictor for all regional models 465 except HG, where bathymetry was second to fetch. Fetch was also important in the WCVI 466 region, the other region we presumed would be influenced by exposure. Fetch, broad-BPI, and 467 tidal flow rounded out the top four variables for the regions. Despite such rankings, the 468 contributions of predictors can be very similar, particularly among those contributing least. For 469 example in the NCC region, 4 of the least influential predictors have virtually equal model 470 contribution scores (Fig. 5). 471 were separated using a regional shape file, whereas in Table 1 the points were selected using 477 regional 20 m rasters. 478 proportion of each predictor's contribution to the model, is shown relative to the predictor with the 481 highest contribution (p/max(p)). The color-shading within each row reflects the underlying 482 numbers from high (red) to low (blue) and is included to make the differences in the values more 483 apparent. NA indicates that Fetch was not used as a predictor for the Coast model. 484 24 Model resolution 485 The most significant predictors for the 100 m coastwide model were related to ocean energy, at a 486 native scale of 3 km, while the regional 20 m models were closely tied to bathymetry and 487 potential wave energy. Yet comparing the 100 m and 20 m models shows little difference in 488 either the aggregated or class-based metrics between resolutions (Table 4, Fig. 2), and there is 489 no suggestion the 100 m model is skewed towards any particular region (Table 4). However, the 490 differences in mapped predictions are dramatic, with the 100 m model showing an over-491 prediction of nearshore Rock in both focal areas, despite class weighting (Fig. 4). The mapped 492 predictions from the 100 m model are also more homogeneous than the 20 m models, which 493 reduces artefacts but also highlights the inability of the coarser model to represent local 494 heterogeneity. 495

Performance across depths
496 The 100 m model shows a clear trend of increasing TSS with deeper water compared to the 20 497 m models based on the testing partition, although results were variable across regions (Fig. 6a). 498 Specifically, the trend is clear for HG and NCC, absent for SOG, and uneven for WCVI and QCS. 499 This pattern is driven by the higher Accuracy of the 20 m models in the intertidal and 0-5 m depth 500 zones (Fig. 7a) across all regions, except the SOG intertidal. The 100 m model has a 501 consistently higher TNR, particularly in the 0-5 m depth zone (Fig. 6b). This can be attributed to 502 the large, homogeneous Rock predictions of the 100 m model in shallower waters (Fig. 4). This 503 allows the model to benefit from the higher (0.75) no-information baseline (compared to the 20 m 504 model). 505 The corresponding error assessment for the two resolutions also shows Accuracy increasing with 517 depth for the 100 m model across most regions (Fig. 7a) but generally decreasing with depth for 518 the regional models (Fig. 7b). Across all models, there is a tendency towards increased Quantity 519 error with depth, but most of the error is from an Exchange between classes. 520 The role of resolution in the correlation between model performance and depth is further 521 supported by the IDE (see below), and is also apparent in the mapped predictions (Fig. 4, Fig. 8). 522 The tendency of the 100 m model to predict more contiguous classes, and over predict Rock 523 near shore is evident in all three of our test regions (Fig. 4e, f, Fig. 8b). However, the 100 m 524 model also captures known features in deeper waters, in particular the canyons in Queen 525 Charlotte Sound and the shelf edge not identified by the 20 m models (Fig. 8). 526 identifies known features at depth (e.g., Moresby Canyon) not captured by the 20 m models. The 530 regional models shown here were produced using unclipped predictor variables to allow model 531 performance in deeper waters to be examined. See text for details. 532 Independent data evaluation 533 The independent data were not consistently distributed across regions: the Dive data were 534 distributed most broadly, while the ROV data were limited to HG and NCC (Table 5, Fig. S1). Not 535 unexpectedly, model performance scores were generally lower and more variable than for the 536 build testing data. The TSS for the 100 m coastwide model was similar (0.53 to 0.56) against all 537 three independent data sets ( Table 5). All regional models scored higher TSS scores (> 0.60) 538 against the Dive and ROV data, while for the Cam data scores varied by region. For each 539 independent data set, accuracies were as much as 50% higher for the regional compared to the 540 coastwide model (Table 5). The TNR scores were highest for the ROV data followed by the Cam 541 data, and notably lower for the Dive data (Table 5). 542 Overall, the Dive and Cam data were predicted with the highest and lowest Accuracy 545 respectively. Accuracy was inversely correlated with Imbalance, which (while variable across 546 regions) was lowest for the Cam data and highest for the Dive data (Table 5). The lower TNR 547 when predicting the Dive data is reflected in the Exchange error (Table 5), indicating that much 548 of the misclassification is the swapping of two categories. 549 Examining the aggregated scores adjusted to their random baselines (Fig. 9) provided a more 550 accurate representation of predictive power. The regional models generally outperformed the 551 coastwide model (with adjusted TSS scores several times higher, and an Accuracy almost 552 double) for the models (HG, NCC) with high sample sizes. For the other regions, the models 553 predicted the Dive data better than the coastwide model, but accurate prediction of the Cam data 554 was variable. The ROV data were predicted with the most consistent Accuracy and TNR scores, 555 while the Dive data were consistently predicted with both the highest Accuracy and lowest TNR. 556 These differences are at least in part due to the lower sample size, but may also reflect some 557 spatial bias in the sampling. 558 The influence of resolution on the correlation between depth and model performance is also 566 apparent when the predictive power of the coastwide 100 m model is compared to the 20 m 567 models for the regions with sufficient independent data (Fig. 10). The IDE of the coastwide model 568 (Fig. 10a) shows a clear increase in predictive accuracy with depth for both the Dive and Cam 569 data, while the opposite pattern is evident in both the HG and NCC regions (Figs. 10a, b). The 570 IDE of the ROV data are more equivocal, illustrating differences between independent data sets 571 and emphasizing the need to understand the differences between the various data contexts. 572 Such differences can be attributed to both data preparation or collection. In the first case, the 576 occurrence of Dive data deeper than 50 m depth (as implied by the 100 m model, Fig. 10a) is an 577 artefact of large pixels in the coastal zone over-generalizing depths. This effect is also evident in 578 the ROV data, which are typically not collected in waters shallower than 20 m. This misallocation 579 of the data is not due to positional inaccuracy of the observations because we screened the IDE 580 data for coherence with the 20 m bathymetry. Rather, this is due to the resolution of the modelled 581 bathymetry. Specifically, since a 100 m raster cell often covers a wide range of depths, 582 particularly in steep topography, any associated observations would be assigned a single value, 583 even if collected at different depths. Thus, what can appear to be a positional inaccuracy or bad 584 data can actually be a function of analytic resolution. 585

586
Our results show that model reliability depends on depth, resolution, and substrate class, and 587 potentially the uniqueness (in predictor space) of the location in question. This means reliability 588 will vary across the coast, with different resolutions and substrate classes being relatively more 589 or less reliable in different locations. Understanding these differences will improve the confidence 590 that can be placed in these models, and their application to emergency response and predictions 591 of habitat suitability. It will also help guide future data collection, and support model refinement. 592 29 Depth and resolution 593 All our models fit the build data well, however the predictive power of the regional 20 m models 594 was universally better than the coastwide model. Our results also support the view that substrate heterogeneity generally decreases with depth, 606 and suggest that resolution-based differences in performance across depths (Figs. 6, 7) are 607 influenced by the actual heterogeneity in different depth classes. For example, the higher 608 accuracy of 20 m models in shallower water is because they can better capture nearshore 609 heterogeneity. The more consistent fit of the 100 m model across depth classes in the SOG 610 region can then be explained because the region is a marginal, relatively shallow sea dominated 611 by mud in deeper waters [63] with a more homogenous coastline than the other, more exposed 612 regions. This highlights the importance of considering process stationarity (see following 613 section). 614 Nevertheless, when true heterogeneity does decrease with depth, it will be better represented by 615 higher resolution near shore, and coarser resolutions in deeper waters. The 20 m models can 616 therefore be considered more appropriate for shallower areas, and the 100 m model better at 617 resolving key features in deeper waters. More specifically, the correlation of predictive power 618 with depth (Fig. 7) suggests that the 20 m models will be more reliable for nearshore studies to 619 about 50 m depth, while the 100 m model would be more reliable in deeper areas. 620 To assess whether sampling density contributed to model performance, we examined model fit 621 to data spatially, looking for patterns in accuracy (not shown). We found sample density to be 622 highly correlated with depth (not surprising given the sampling context of much of the build data), 623 making it impossible to disentangle the effect of density from depth with our available 624 observations. While understanding the role of sample density would provide important 625 information to the design of sampling programs, accuracy is likely to be maximized when sample 626 density is the same or better than the analytic resolution. Ideally, choice of resolution would 627 include an explicit rationale to ensure the resulting product is not mis-interpreted by drawing 628 inferences at inappropriate resolutions. 629 Analytic resolution also influences process stationarity because processes are scale-dependent 630 (Wiens 1989). This means coarser models will better represent more averaged conditions (which 631 by definition have less variability and correspondingly higher stationary). Thus, our observed 632 differences between the 20 m and 100 m models are also due, in part, to the different processes 633 captured by the different resolutions. 634

635
Correctly representing driving processes is central to predictive power (Austin 2002). However, 636 the reliability of such representations across a seascape depends in part on the assumption of 637 stationarity, a typically tenuous assumption, particularly across larger extents [64]. In this 638 analysis, both class-based results (Fig. 2) and variable importance (Fig. 5) showed strong 639 evidence for non-stationary processes across regions, while non-stationarity across both regions 640 and depths is suggested by differences in aggregated metrics of model fit (Figs. 6, 7) and error 641 assessment (Fig. 9). 642 There are obvious reasons for non-stationarity across regions. For example, the SOG is 643 dominated by sedimentation processes (e.g., runoff), that in more exposed regions will have a 644 stronger interaction with water energy. Elsewhere, the unique characterization of variable 645 importance on the NCC likely reflects its geomorphic context, where the competing dominance of 646 tidal energy in channels and wind-wave energy (fetch) in inlets creates within-region differences. 647 Finally, predictions of sand in exposed coastal areas of the WCVI may be made more difficult 648 because very local processes differ from an otherwise strong association between Rock and high 649 fetch. Other (unrepresented) factors such as freshwater input and upland geology will also differ 650 both within and across regions. Given that these processes play out on a crenulated coastline 651 carved up by deep narrow fjords, it's not surprising they depend on local context. 652 Such local processes cannot be teased out by classification models. Instead, they are 653 generalized across the model domain according to the prevalence and spatial distribution of the 654 observations. We argue that model reliability is higher in places where predictor values are 655 closer to the center of their ranges, thereby avoiding boundary conditions. This means 656 oceanographic uniqueness will likely be correlated with poor model quality. 657

Bias and accuracy
658 Models of substrate, habitat and climate are regularly built on other models. Thus, any layers 659 used in this way (e.g., bathymetries, current models, remotely sensed primary production) will 660 have their own artefacts, uncertainties and limitations. Careful consideration of data sampling 661 and processing steps such as averaging, aggregation, or interpolation methods is therefore 662 warranted. This is most evident in this study by the assignment of over-generalized depths to the 663