Semi-Automated Image Analysis for the Assessment of Megafaunal Densities at the Arctic Deep-Sea Observatory HAUSGARTEN

Megafauna play an important role in benthic ecosystem function and are sensitive indicators of environmental change. Non-invasive monitoring of benthic communities can be accomplished by seafloor imaging. However, manual quantification of megafauna in images is labor-intensive and therefore, this organism size class is often neglected in ecosystem studies. Automated image analysis has been proposed as a possible approach to such analysis, but the heterogeneity of megafaunal communities poses a non-trivial challenge for such automated techniques. Here, the potential of a generalized object detection architecture, referred to as iSIS (intelligent Screening of underwater Image Sequences), for the quantification of a heterogenous group of megafauna taxa is investigated. The iSIS system is tuned for a particular image sequence (i.e. a transect) using a small subset of the images, in which megafauna taxa positions were previously marked by an expert. To investigate the potential of iSIS and compare its results with those obtained from human experts, a group of eight different taxa from one camera transect of seafloor images taken at the Arctic deep-sea observatory HAUSGARTEN is used. The results show that inter- and intra-observer agreements of human experts exhibit considerable variation between the species, with a similar degree of variation apparent in the automatically derived results obtained by iSIS. Whilst some taxa (e. g. Bathycrinus stalks, Kolga hyalina, small white sea anemone) were well detected by iSIS (i. e. overall Sensitivity: 87%, overall Positive Predictive Value: 67%), some taxa such as the small sea cucumber Elpidia heckeri remain challenging, for both human observers and iSIS.


Introduction
Despite recent advances in technology and increased efforts to ''Census the Marine life'', the deep ocean floor remains the largest and yet least explored ecosystem on Earth [1]. Deep benthic communities are characterized by a high species diversity, which reflects a much larger regional pool of species than in shallow waters [2], constituting a pool of transient potential immigrants to other areas [3]. Megafauna play an important role in benthic ecosystems and contribute significantly to benthic biomass [4][5][6], particularly in the Arctic [7]. Benthic megafauna are defined as the group of organisms inhabiting the sediment-water interface, exceeding 1 cm diameter [8,9]. Megafaunal organisms increase habitat heterogeneity as they create pits, mounds, tracks and traces in the sediment. Erect biota, such as sponges, bryozoans and coral, increase three dimensional habitat complexity and provide shelter from predation [10,11]. Megafauna can therefore increase the diversity of smaller sediment-dwelling biota in otherwise largely homogenous soft-bottom environments of the deep-sea [12][13][14]. In addition, megafaunal predators control the population dynamics of their prey and are thus important in determining benthic food webs and community structure [15][16][17][18][19]. They also contribute considerably to benthic respiration and affect the physical and biogeochemical micro-scale environment [20][21][22][23][24][25][26]. It is also important to note that deep-sea benthic megafauna sequester carbon through the continuous redistribution of organic matter, oxygen and other nutrients within surficial sediments [23,27].
While time series data on megafaunal dynamics over longer scales are still scarce [12,[28][29][30][31], multi-year time-series studies from the Porcupine Abyssal Plain and the northeast Pacific have attributed megafaunal changes to environmental and climate variation [32,33]. To date, most studies on megafaunal assemblages in the Arctic represent single snapshots in time, scattered over different basins [7,[34][35][36][37][38][39][40][41][42][43]. Although such studies provide important biogeographic information, there is currently a serious gap in the knowledge of the temporal dynamics of megafaunal assemblages from these northern latitudes over longer time spans. The HAUSGARTEN observatory [44], established in 1999, represents an important step forward in temporal investigation of the polar region, with large volumes of data collected from the observatory on a regular basis, consisting of both oceanographic data and repeated video and still image collection from a number of fixed survey transects.
Conventionally, megafaunal assemblages are investigated by bottom trawls [45,46]. However, such gears have low and/or variable catch efficiencies for different organisms [47,48] and are invasive. In recent years, towed camera systems have become a key method to determine the density and distribution of deep-sea megafauna [29,40,42,[49][50][51][52]. Although visual surveys are limited to species that are large, epibenthic and non-evasive, they enable the study of the seafloor on a range of scales from cm to kilometers without disturbing habitats [53,54]. Large scale analysis is  important, as deep-sea megafauna species are often characterized by rare or aggregated occurrence [43,55]. Furthermore, this method allows repeated observations of defined tracks, both minimizing the noise produced by spatial variation and allowing time series analysis. Inevitably, the application of imaging techniques generates large quantities of digital image material. Particularly large volumes of footage accumulate in the archives of institutions that run modern remotely operated and autonomous underwater vehicles. The analysis of these images constitutes a bottleneck, since the evaluation of one image with a footprint of 3-4 m 2 , can take 30-60 min or longer, requires training, is subjective and potentially error-prone [56]. Indeed, similar taxonomic classification tasks yielded human consistencies as low as 67-83% (intra-observer) and #43% (inter-observer) [56,57]. To solve this bottleneck problem, computational approaches for taxon detection and classification have been proposed in different contexts. Until now, a number of these are restricted to controlled environments [58,59], the detection of manufactured objects [60][61][62] or designed to work specifically in the water column [58,63] where no sediment (i. e. background) has to be distinguished from the taxa investigated. In the majority of published cases, a single taxon or a group of similar taxa [56,[63][64][65][66][67][68] is studied and taxoncustomized features are utilized. In other studies, whole images are classified [69] or seafloor images are segmented and each segment is classified automatically afterwards [70][71][72][73].
To quantify a heterogenous group of megafauna successfully with one system a flexible software approach is needed, which can be applied to taxa exhibiting a variatey of features, such as differing morphologies or colors. The iSIS (intelligent Screening of underwater Image Sequences) system was developed with such an approach in mind, utilizing a generalized pattern recognition approach for the semi-automated quantification of megafauna in transect data collected at HAUSGARTEN. The approach is referred to as general, since no explicit heuristics were used to design and optimize the algorithmic detection of individual taxa. The taxonomic scope of the system is set to a user defined group of taxa. These groups are defined in the system by a hand-labelled training set of images with marked positions for the taxa. In this way, the user (e. g. a marine biologist) can use her/his primary visual expertise to tune and extend the system without a deeper knowledge of the image-processing algorithms being required. So although the pre-processing and the taxa detection in iSIS runs fully automated, the system is characterized as semi-automatic as the system is trained using these manually identified taxa from within a small image subset of the full transect. In this article we describe the iSIS architecture and present its application to transect data collected at a HAUSGARTEN station. The accuracy of the taxa detection is assessed using a gold standard of taxa positions in 70 images, with this gold standard generated from position labeling of taxa by five experts who evaluated the 70 images manually.
The paper is organized as follows: In the next section, we will first introduce the image data, used in this study. Afterwards, the position labeling study, carried out by the five independent experts is described. The remainder of the section deals with the algorithmic details of the iSIS system. In the results section, the findings of the human position labeling experiment, the preprocessing step and the learning and detection performance of iSIS are presented and discussed.

Materials and Methods
The deep-sea observatory HAUSGARTEN [44] is located in the eastern Fram Strait west of Svalbard, the only deep-water connection between the Atlantic and Arctic Ocean proper ( Figure 1). No specific permits were required for the described field studies as the data was obtained outside national waters. The location of HAUSGARTEN is not privately-owned or protected in any way as it is outside the exclusive economic zone of any nation. To our knowledge our study did not involve any endangered species, and given the remote photographic nature of the data collected, no negative impact on biota was made.
HAUSGARTEN comprises nine sampling stations along a bathymetric gradient (1200-5500 m). A latitudinal transect crosses at the central HAUSGARTEN station IV, which serves as an experimental area for long-term experiments and measurements [74][75][76][77][78][79][80][81]. In 2002, the AWI started regular towed camera observations of the HAUSGARTEN stations during expeditions of the research icebreaker RV Polarstern. To capture images from the seafloor, an ''Ocean Floor Observation System'' (OFOS) was deployed at different stations with water depths between 1200 and Figure 3. The combination of human labels to gold standard labels. The left image shows a small white sea anemone with two human labels (as circles) which is not enough to create a gold standard label as a supporter count of k §3 was required (see text for details). The image in the middle shows a Kolga hyalina labeled by k~5 experts and its resulting gold standard label in between (as a cross). The right image shows a Bathycrinus carpenterii with human labels for the crown (blue) as well as the stalk (yellow). Both human label cliques have k §3 supporter and thus two gold standard labels are created. doi:10.1371/journal.pone.0038179.g003 The taxa with their human and gold standard label amounts. Gold standard labels are computed as the centroid of a group of closely neighbouring human labels of the same taxon. Only groups with §3 human labels were taken into account. The background labels were randomly distributed and were all used as gold labels. Additionally, the inter-and intra-observer agreements are given by average and standard deviation pone.0038179.g001.tif(std-dev.) for the five experts. doi:10.1371/journal.pone.0038179.t001 5500 m (for details see [43]). The OFOS is a towed camera system and its altitude is affected by waves, currents, bottom topography and skill of the winch operator. From 2002-2008, more than 45,000 images were taken by an analogue camera, with these images then digitized at a resolution of 350462336 pixels. The images were then made accessible in the BIIGLE online platform for browsing and taxa annotation [82]. A number of benthic megafauna experts have acquired BIIGLE accounts in the last two years and to date have labelled w350,000 objects in w12,000 images. For this study, one transect of intermediate water depth was chosen (HAUSGARTEN IV, 2500 m [31]), which has been successfully visited four times by Polarstern to date (2002,2004,2007,2011). During each campaign, some 700 images were taken. In all images, a field of view of 150061800 pixel size at position x = 1800, y = 300 was selected, to exclude the image region covered by the OFOS forerunner weight and the camera time stamp.
The OFOS operator tried to maintain the OFOS at a uniform 1.5 m height above the seafloor, resulting in a real-world footprint of 1.2-8.5 m 2 per image with an average of 3.77 m 2 across the entire transect. The OFOS altitude varied throughout the entire transect as the winch operator adapted to bottom topography and sea state resulting in variable lighting conditions, with overexposed images produced when the OFOS was too close to the seafloor, and almost black, poorly illuminated images produced when the OFOS was too distant from the seafloor. Some 10% of the images of a transect showed no signal contrast at all and were excluded from this study. The remaining images showed a decrease in lighting and contrast towards the image corners -a vignette effect.

Human Expert Labeling
The basic idea behind the iSIS architecture is that a general machine learning based object detection system acquires the knowledge of the structural features of objects of interest (here taxa) as well as the non-interesting patterns from a set of image patches showing representative examples of all taxa. The performance of the system can be assessed using a so-called gold standard, created from taxa positions provided by human experts for comparision with the machine produced results. Since we were aware of the inter-and intra-observer agreement problem in human expert labeling tasks, we carried out a position labeling study with five human experts (i.e. the authors M. B., J. T., J. G., A. P. and J. D. ). This study had two aims: firstly, assess the taxonspecific human experts' inter-and intra-observer agreements across a range of images. The second aim of this study was to allow collection of human expert position labels for use in generating a gold standard for the taxa detection. To carry out the study, a subset of 10% of the 2004 transect (i. e. N = 70 images) were shown to five experts. These 70 images were randomly chosen from those with a footprint of 3.5-4.5 m 2 (i. e. 226 images). The experts were given the task of labelling the positions of all individuals in these images belonging to a set of 14 taxa/seabed features (the sponges Cladorhiza gelida, Caulophacus arcticus, Caulophacus debris, a small white sponge, the soft coral Gersemia fruticosa, a small white sea anemone, a purple anemone, the whelk Mohnia spp., the isopod Saduria megalura, the sea cucumbers Kolga hyalina and Elpidia heckeri, the sea lily Bathycrinus carpenterii, Bathycrinus stalks and ''burrow hole''). Taxa that gathered v150 labels across the 70 images were excluded from further analysis. Samples of the eight remaining taxa (T m ,m [ f1,::,8g, including the category ''burrow hole'') are given in Figure 2.
We chose the 2004 transect as it had already been extensively labelled by two of the experts and it was evident that different species, characterized by a variety of structure and color features,  In the validation step, iSIS was applied to the entire images for taxa detection and the detection results were compared to our gold standard g (k) m by computing SE and PPV. The performance decreases significantly from the test data to the validation due to an increase in FP. The last row shows SE and PPV results after a careful re-evaluation of the FP (see text for details) yielding our final estimates for iSIS' SE and PPV. The last column shows the correlation between object counts of the gold standard items and the machine detection result for the full transect. doi:10.1371/journal.pone.0038179.t002 occurred in this image series. This species heterogeneity was important to investigate the general applicability of the iSIS system.
The position labeling results of the five experts were compared to determine inter-observer agreements [83]. Observer agreements (OA) were computed for all pairwise combinations of two experts U and V and their corresponding sets of hand labels L U and L V by: where # means ''items in'' and OA z is given as the set of labels contained in both L U and L V : and OA { U as the set of labels contained only in L U : and analogous for OA { V . To measure intra-observer agreements, each expert re-examined 35 images after 14 days. The intra-observer agreements were computed for each expert U and her/his hand labels created before (L U ) and after the 14 day break (i. e. L V~L z14 U )) with eq. (1).
To collect a gold standard for the taxon detection, the position labels L m (m [ 1,::,8) for each taxon T m , obtained by all five experts within an image were fused to taxon cliques, each clique summarizing the marked positions for one object of a taxon class T m . A set of position labels of one taxon T m with a pairwise Euclidean distance smaller than a taxon-specific maximum distance d m is regarded as a clique. The number of labels in a clique is denoted by k, which ranges from k~1, where only one expert (i. e. supporter) found the item, to k~5, where all experts agree on the occurrence of this item. For each clique, a gold label position g (k) m~( x,y) of x,y-coordinates was computed as the   centroid of its supporting clique's position labels (see Figure 3). The taxon label numbers and observer agreements are given in Table 1.

The iSIS System
The approach for object detection encompasses three major steps: Pre-processing and feature extraction (Step 1) is necessary to reduce illumination effects and to map image patches to highdimensional representations in a vector space model, so-called feature vectors. In Step 2, these feature vectors are used to train a machine learning algorithm (Support Vectors Machines (SVMs)), utilizing the human expert position labels. To detect the taxa in one image, the trained SVM classifiers are applied to the feature vectors derived at every pixel within the field of view of each image. The pixel-wise classifications are written to so-called confidence maps. In the final Step 3, these confidence maps are then post-processed to derive positions of possible taxa and a numerical value for the number of taxa in every field of view. An overview of the whole approach is given in Figure 4.
Step 1: Feature extraction and pre-processing. To keep the taxon detection as generic as possible, a set of feature descriptors capable of describing arbitrary objects, based primarily on the MPEG7 standard was computed [84,85]. The MPEG7 standard defines 18 descriptors for different characteristics of digital images, each descriptor comprising an individual number of features. There are five descriptors for color features, three for texture, and ten others, which focus on structure, motion and face detection. Depending on the image domain, some of these are more useful than others (e. g. face descriptors were not used here). The descriptor set consisted of four color descriptors (i. e. Color Structure, Color Layout, Scalable Color, Dominant Color), one texture descriptor (Edge Histogram) as well as an adapted structure descriptor [86].
In principle, the two other texture descriptors specified in the MPEG7 standard would have been useful too, but require a minimum region for extraction ( §1286128 pixels), which would have added too much background signal in this setup. Those MPEG7 texture descriptors are based on a multi-scale, multiorientation Gabor Wavelet filtering and describe spatial relationships between Gabor responses as well as dominant responses in the extraction region. To include the principal ability of Gabor Wavelets to describe textural features, the outputs of a modified version of a 3-scale, 5-orientation Gabor bank [87,88] were added as additional features without regard to their spatial occurence or dominance.
Features were extracted within a frame of 32632 pixels to create a rich feature representation of 424 dimensions (Figure 4, middle) for a neighborhood around an image pixel.
To correct the lighting conditions of an image I n (n = 1.N), it was filtered with a Gaussian kernel of size M and yielded a smoothed image G n . I n and G n are composed of three color channels c (c [ red (R), green (G), blue (B)), here denoted by a superscript (e. g. I G n for the green channel of image I n ). By subtracting G n channel-wise from I n , the lightness falloff towards the corners was removed: Afterwards, the histograms of each of theÎ I n were transformed to gather similar color distributions across the whole transect and thus yielded the image F n that was then used for feature extraction. F n is also a 3-channel image and each of the channels is computed by: with: and: c~l og (128) log (C|(g peak {g min )) ð7Þ The values g min , g max and g peak were computed using the grayscale image I n : by searching the peak in the histogram of I n (i. e. the gray value g peak with the highest pixel number in I n ). Starting from the peak value, g min /g max were chosen as the nearest gray values below/ above the peak with 1=1000 th of the peak's pixel number.
To omit a dispassionate manual tuning of one important parameter in this pre-processing, the Gaussian kernel size M, a data driven tuning approach was developed. Feature vectors were computed for each human label position from images preprocessed with different M values, ranging from 1 to 1501. Following the standard pattern recognition paradigm, feature vector clusters should identify the taxa. This motivates the selection of that particular M value that leads to separated taxa feature clusters, i. e. a crisp cluster structure. To measure the clustering quality for different values of the kernel size M, the cluster indices (i. e. Chalinski-Harabasz [89], Index-I [90] and Davies-Boudlin [91]) were computed as well as the intra-and inter-cluster variance. The kernel size M leading to the best clustering result was chosen for pre-processing the entire transect. Step 2: Training data and machine learning. For the machine learning step, i. e. teaching the classifiers to distinguish one taxon from other objects and from the background, training sets of feature vectors for each taxon are required. To collect a training set for a taxon T m , feature vectors were computed for positions g (k) m condition to a support count k §3. Because of the low numbers of remaining taxon labels, caused by some taxa's sparse population of the seafloor, the amount of feature vectors were boosted five-fold by computing them at the human label positions as well as at their 4-connected neighbours [92] in two pixel distance. This also adds some variation to the taxa representations. In addition, feature vectors were extracted from randomly distributed positions within each image and with a minimum distance to all human labels within each image. These feature vectors served as representatives of the background class (T 0 ).
In pattern recognition, normalization of features is of crucial importance. In the iSIS system, features are grouped according to domains, e. g. all of the 15 Gabor features or the single ''number_of_dominant_colors'' feature (which belongs to the descriptor Dominant Color [84,85]) form two domain groups. Feature domain groups are treated individually in the normalization and features values within a group were normalized together to have a mean of 0 and a standard deviation of 1.
After normalization of all domain groups, an individual set of feature vectors C m~f xg(x [ ½0,1 D ) was composed for every taxon T m , that consisted of 50% positive and 50% negative samples. The positive samples were all feature vectors computed for training set positions of one taxon T m . One half of the negative samples consisted of background (T 0 ) feature vectors. The other half comprised equal amounts of all other taxa T p=m,0 . The background feature set C 0 consisted of 50% background samples (positives) and 50% of equal amounts of all other taxa (negatives). Since the abundance of species varied, the size of the feature vector sets C 0 {C 8 also varied. Using the nine feature vector sets C m , nine classifiers were trained, each one to classify a feature vector as either taxon-positive or negative. iSIS uses SVM classifiers [93] for feature vector training and classification. SVMs are widely used, because of their generalization performance in non-trivial, high-dimensional feature spaces, i. e. their ability to correctly classify previously unseen data. Further advantages are the absence of local minima in their training errors during optimization [94] and the low number of parameters (i. e. two in this case) that have to be tuned.
To train the nine SVMs (one for each taxon and one for the background), an implementation of SVMlight was used [95], wrapped by our own C/C++ machine learning library. A Gaussian kernel was used, and, in a first training step, optimal parameters for the kernel size s and the SVM penalization parameter C were estimated by logarithmic sampling of the parameter space (10 a , a [ f{1,::,2g for C and s, respectively). Small values of C indicate low penalization of errors, leading to a better generalization. Small values of s can create SVMs that tend to over-fit the training data, which results in a poor generalization. A 4-fold cross validation [96] was applied to tune the SVM parameters, i. e. three quarters of the feature vector set C m were used as the SVM training set for the SVM of taxon T m and the 4 th quarter as the test set.
A trained SVM classifies a feature vector either as class-positive or -negative. The classification result for one feature vector can thus be assigned to one out of three groups: 1) True positives (TP) were correctly identified positive samples. where # means ''amount of''. Both measures range between 0 and 1. The PPV of the test set is of special interest in detection tasks, since the highest priority is to minimize the number of false positives in unseen data. After determination of the optimal s and C, the SVM training was repeated for each taxon T m with the full feature vector set C m . Those SVMs were then used for classification of the full field of view of all images fF n g in the transect.
Step 3: Post-processing. All SVMs were forced to create a normalized output between 0 and 1 [97], such that a confidence map is created for each species. To gather a quantification of the species at hand, a post-processing was applied to the confidence maps, which consisted of two steps for each taxon T m . First the confidence maps were binarized with thresholds t m . Connected regions (i. e. blobs) in the resulting binary image were then compared to a taxon-specific minimum blob size s m . The trained SVMs were organized to a tree ( Figure 5), to avoid a timeconsuming classification of each pixel by all nine SVMs. Because the average taxon occurrence per image was sparse (i.e. ranging from 0.5 for Kolga hyalina up to 16.3 for ''burrow hole''), a 5 pixel margin around classified pixels was not used for further classification, to prevent false positives in unusually short distances around detected objects. For the background confidence map, the binarization threshold was set to t 0~0 :6 and no blob detection was performed. The other taxa confidence thresholds t m , the minimum blob size thresholds s m and the order of the SVMs in the classification tree were automatically tuned, analogous to the pre-processing. Similar to the fusion of human label cliques to gold standard labels, a distance threshold was used for each taxon T m to match the gold standard labels g k m with the detected blob centroids. These assignments were evaluated to classify each blob centroid and each gold standard position as TP, FP or FN. From these quantities, the SE and PPV were computed.

Results
The human experts showed varying degrees of inter-observer agreement across different taxa, which is a phenomenon wellknown from similar visual diagnosis and assessment tasks. An agreement of 97% was found only for the conspicuous sea cucumber Kolga hyalina whereas the human detection performance was only 70% for a small white sea anemone and even 35% for the sea cucumber Elpidia heckeri and 32% for a small white sponge. While the semi-automatic approach showed a performance at least similarly accurate for the ''easy'' species (see details below), it produces good, and, above all, re-producible detections ( Table 2). The performance for taxa with morphological characteristics close to the resolution limit prevented successful identification by either humans or iSIS. In the following we will summarize the results obtained using the iSIS approach.

Pre-processing
For the tuning of the Gaussian kernel size M, all cluster indices showed similar results: features extracted from unfiltered images were sub-optimal in their ability to create clusters in feature space, while pre-processing with small kernels (Mv20) reduced the performance even further. The same occurred when using larger kernels (Mw1100), where the cluster measures tended to show poorer results as well. Interestingly, all cluster measures remained relatively stable for kernel sizes between 50 and 1000 with no obvious peak, i. e. no optimum kernel size could be derived. Although the color distributions of differently pre-processed images are diverging (small kernels lead to grayish images with high contrast, large kernels to smoother colors with less local color deformation), the utilized features do not seem to be affected in their capability to form taxon clusters. A kernel size of M = 701 was chosen, as the resulting images usually showed good lighting correction and contained only moderate color distortion. Some examples of the pre-processing and the normalized output of the cluster indices are given in Figure 6. The pre-processing takes ca. three minutes per image.

Machine Learning and Post-Processing
The normalization of the feature vectors and the construction of the training and test sets took less than one minute. The SVM parameters differed for different species. The performances for training and test data are given in Table 2. The first training step to determine the optimal C and s took about five minutes per SVM, which is the same as the time needed for the final SVM trainings together. The post-processing takes less than one minute for all nine SVM outputs combined.
The performance measures for iSIS are shown in Table 2. The classification performance on the training data is displayed as training-and test-error, showing a satisfying learning result for all taxa. In the validation experiment, we applied the trained SVMs to every pixel within the full field of view for a pixel classification and to be able to detect the taxa. We evaluated the classification result using our gold standard g (k) m , computing SE and PPV for each taxon class individually. The total counts for the gold standard and the machine results are given for each image in Figure 7, the correlation values of these are also given in Table 2. These correlation values give a different view of the results. While the SE and PPV values show the detailed performance at single object level, the correlation measure averages out some mistakes if false positives and false negatives occur within the same field of view. Two examples of the detection result are given in Figure 8.
The final results, as given in Table 2, look unsatisfying at first sight, especially the PPV values for the detection experiment in the entire images, i. e. the validation. A closer look at single FPs leads to the assumption that the false positive counts based on the reference gold standard were incorrect, i. e. many positives found by iSIS, which were not included in the gold standard were actually true positives. All false positives were thus re-analyzed by two of the authors (the authors M. B., T. S. ) to determine, what kind of mistakes happened during the detection. The results of this re-evaluation are given in Table 3. The last row of Table 2 incorporates these numbers and indicates a much better performance (SE: 87%, PPV: 67%). Approximately one third of the false positives were indeed true positives that were not labelled by the experts at all (k = 0) or were not included in the gold standard due to a low supporter count (kv3).
To study the effect of a higher or lower value of k, i. e. the effect of a more or less conservative gold standard setting, iSIS was run with values of k~1,2,::,5. This was done only in the postprocessing, so the SVMs were not retrained, which would have affected the detection process. While a low value of k resulted in a higher PPV, a high value of k resulted in a higher SE. The performance values for different k are given in Table 4. The results show that the performance of a semi-automated detection approach is significantly affected by the initial training gold standard. If the main goal is to lose the lowest number of objects, which are prototypal for their species, only gold standard objects with a high supporter value should be used in the analysis.
One particular taxon (Elpidia heckeri) could not be detected reliably since its features (color and morphology) could not sufficiently be discerned from the sediment background. Samples of this species cover only a small amount of pixels (v50) and resemble stones in their structural appearance. While the SE of 0.91 is satisfying, the PPV of 0.04 shows, that a vast amount of false positives are detected by the SVM trained for this taxon. The challenges in detecting Elpidia heckeri with iSIS reflect the low interand intra-observer agreements for this species. Omission of Elpidia heckeri from the detection process led to a removal of about half of the total false positives (see 13th row in Table 2).

Discussion
In our work we have addressed the question how the concepts of pattern recognition and machine learning can be applied to design a data-driven approach to the automation of taxa detection and megafauna quantification in large underwater image collections from camera transects. The most important design principle of this study was to develop a system which would enable a noncomputer expert, a typical skilled taxonomist or other user, to adapt the system to new transects and/or for detection of further megafauna species (for instance starfish, which have not been considered here but occur in HAUSGARTEN transect data). Our results show how a gold standard of human labelled taxon positions in a training subset of images can be used to tune preprocessing steps and to train supervised machine learning algorithms (such as SVMs) in pixel classification tasks. Our results for training-, test-and validation errors show that the biggest remaining challenge is to improve the training step, reducing the FPs in the validation and to improve the estimates for the errors on new data, since the contrast between test-and validation error is considerable. We also found two factors to have a negative influence on the PPV estimates. First, the re-evaluation of the FP showed that about one third of the false positives were indeed true positives. Second, another 30-40% of the FP were species that were not included in the SVM training (e. g. Caulophacus arcticus, Caulophacus debris, Mohnia spp., etc.). Thus, including these species in the training data could have the potential to reduce the FPs. We estimate the remaining true FPs to be approx. 30% of the original number. These FPs are misclassifications between different taxa or background pixels classified as taxa. Incorporating the additional true positives, the total SE value rises to 0.87, which is only a minor advantage, although the total PPV is then 0.67, which is a major improvement. Assuming optimistically that those species for which iSIS was untrained so far in this study can be identified with similar SE and PPV values as the species thus far studied, the PPV for the dataset as a whole could potentially increase to 0.83. Another strategy to further improve the results would be to omit regions within images from classification, based on the density of detected objects. Parts of images, covered by fauna such as Caulophacus arcticus (Figure 8, bottom), create several FP, which are closely distributed and hence distinguishable from other regions as detection results are usually sparse.
Although false positives remain in the iSIS analyzed data, the system can be applied to speed up the quantification of megafauna taxa substantially. A full manual evaluation takes approx. 30-60 minutes per image (and is error prone for many taxa). One way to massively reduce this time would be to first apply iSIS to mark all potential positions of taxa of interest and let a user review the positions (for instance in a guided zoom-in mode) and mark iSIS produced detections as accept or reject. Such a posterior evaluation of iSIS-detected taxa in an image takes about 1 minute, estimated from our own experience using the BIIGLE system in similar contexts. Without such a re-evaluation, the detection results may overestimate the occurences of taxa and can thus not yet be used for quantitative investigations of transects.
If the system is not trained with key species of the ecosystem, the detection counts may lead to incomplete or incorrect assumptions about habitat processes. Careful consideration of relevant species and suitably large sample sizes of those species are therefore vital for successful application of the iSIS approach.
The number of individual examples of a species required for successful detection may vary across species. An approach to estimate this amount could utilize cluster indices as for the optimization of the kernel size M in the image preprocessing. The iSIS system could thus request further labels from the expert if the cluster indices indicate that insufficient amounts of a taxon have been labelled (i.e. the feature representations of this taxon do not yet form clusters in the feature space).
Keeping those prerequisites in mind, iSIS currently allows the collection of taxa positions with reduced effort, which enables researchers to carry out investigations of the taxa densities, their dynamics over time and species co-occurences more efficiently.
This could potentially open the large data archives created by farsighted seafloor observation programmes and give deeper insights into distributions and dynamics of communities of benthic megafauna. The use of iSIS with re-evaluation allowed us to quantify megafaunal densities over the whole HG IV transect for the first time (Fig. 4 bottom right). From this analysis, certain conclusions on species distribution are immediately apparent, such as a patchy occurrence of the small white sea anemone (possibly Bathyphellia margaritacea) along the HG IV transect, which is corroborated by [43]. Although present throughout the whole transect, iSIS detected higher densities of the sea lily Bathycrinus carpenterii towards the last two thirds of the transect whereas the opposite was true for the sea cucumber Kolga hyalina. This could be a result of species interactions and/or differences in the spatial distribution of resources. Since megafaunal organisms affect the distribution of smaller-sized biota and shape benthic food webs through predation such findings are important in understanding ecosystem functioning. Furthermore, the envisaged application of iSIS to HG IV footage from different years will enable us to assess changes in the distribution of key megafaunal species over time in an area particularly vulnerable to the effects of climate change. We will also apply iSIS to images from other HAUSGARTEN stations and possibly other benthic locations. Advances in camera technology, associated with higher image resolution, will allow improved detection performances in the near future.
iSIS shows how computerized image analysis can assist in the inspection and monitoring of deep-sea benthos. The results resemble those produced manually by human experts, whilst greatly reducing human time commitment and removing the negative effects of observer fatigue. Further publications on automated detection approaches for benthic images are worthwhile to investigate non-easily accessible marine areas without contemporary intervention in the benthic system by sampling gears. The development of such automated systems is a new field of marine research, and allows the creation of new tools to improve the ongoing efforts to explore and understand the vast uncharted regions of the seafloor.

Acknowledgments
Special thanks go to Muhammet Bastan for providing us the source code of the MPEG7 feature extractors. Thomas Soltwedel provided the map of HAUSGARTEN. We thank Antje Boetius for comments to the manuscript and financial support by the DFG Leibniz program. Three anonymous reviewers improved an earlier draft of this paper. This is publication 10013/ epic.39189 of the Alfred Wegener Institute for Polar and Marine Research. MB was funded by KongHAU (The Kongsfjord-HAUSGARTEN transect case study: Impact of climate change on Arctic marine community structures and food webs). KongHAU is closely linked to the EU project HERMIONE. AP was funded by the European Communitys Seventh Framework programme (FP7/2007-2013) under the HERMIONE project (grant agreement no. 226354).