Repeatability of flatfish reflex impairment assessments based on video recordings

Using measures of reflex impairment and injury to quantify an aquatic organism’s vitality have gained popularity as survival predictors of discarded non-target fisheries catch. To evaluate the robustness of this method with respect to ‘rater’ subjectivity, we tested inter- and intra-rater repeatability and the role of ‘expectation bias’. From video clips, multiple raters determined impairment levels of four reflexes of beam-trawled common sole (Solea solea) intended for discard. Raters had a range of technical experience, including veterinary students, practicing veterinarians, and fisheries scientists. Expectation bias was evaluated by first assessing a rater’s assumption about the effect of air exposure on vitality, then comparing their reflex ratings of the same fish, once when the true air exposure duration was indicated and once when the time was exaggerated (by either 15 or 30 min). Inter-rater repeatability was assessed by having multiple raters evaluate those clips with true air exposure information; and intra- and inter-rater repeatability was determined by having individual raters evaluate a series of duplicated clips, all with true air exposure. Results indicate that inter- and intra-rater repeatability were high (intra-class correlation coefficients of 74% for both), and were not significantly affected by background type nor expectation bias related to assumed impact from prolonged air exposure. This suggests that reflex impairment as a metric for predicting fish survival is robust to involving multiple raters with diverse backgrounds. Bias is potentially more likely to be introduced through subjective reflexes than raters, given that consistency in scoring differed for some reflexes based on rater experience type. This study highlights the need to provide ample training for raters, and that no prior experience is needed to become a reliable rater. Moreover, before implementing reflexes in a vitality study, it is important to evaluate whether the determination of presence/absence is subjective.


Introduction
To address concerns over discard practices and animal welfare in commercial fisheries, methods are needed to reliably profile fish condition onboard vessels to describe fishing impacts on both individuals and populations [1][2][3][4]. To allow assessments in adverse and remote conditions, responsiveness of a fish to induced stimuli expressed as a binary presence-absence score PLOS  beam-trawler, the R/V Simon Stevin and at a research laboratory in Ostend, Belgium. All research-related handling was designed to minimize any stress cumulative to being captured by beam trawls and sorted on deck. For example, any air exposure during the reflex tests was kept to a minimum and was well within exposure times during conventional, commercial sorting practices. If fish were held captive, housing mimicked natural conditions. The filming did not require any extra handling procedures. Animal ethics approval was granted by the Flanders Research Institute's for Agriculture, Fisheries and Food (ILVO) Animal Care and Ethics Committee (EC2016/264).

Equipment and treatments
To evaluate inter-and intra-rater repeatability, we conducted a series of seven workshops where separately either third-year, veterinary medicine students, practicing veterinarians/food safety inspectors, or fisheries scientists scored four reflexes of common sole from short (<30 s) video sequences (or 'clips' ; Fig 1). The first scoring session, conducted in April 2015, was attended by third-year, veterinary medicine students from the University of Ghent (N = 120 female and N = 35 male students; N = 2 male non-student experts). The second session, in May 2015, occurred during a lunchtime seminar with fisheries research scientists, with diverse expertise (N = 7 female and N = 11 male). The third session, in December 2015, was during an international workshop of fisheries research scientists with specialist expertise in discard survival studies (N = 5 female, N = 8 male). The fourth session, in January 2016, included seagoing fisheries observers and fisheries technicians (N = 6 female, N = 13 male). The final sessions (5-7) were shown: (5) in April 2016 to third-year veterinary medicine students from the University of Ghent (N = 140 female, N = 39 male students; N = 2 genderless; N = 1 male expert); (6) in December 2016 to practicing veterinarians/food inspectors (N = 12 female, N = 20 male; N = 1 male expert); and (7) in December 2017 to fisheries research scientists with diverse expertise (N = 6 female; N = 7 male; N = 1 genderless).
The reflexes that were selected (body flex, righting, head, and tail grab; Table 1) were those used to assess common sole [4] and that were clearly visible in a video clip. Each workshop began with a 15-min lecture with visual aids detailing the utility of the reflex scoring method as an animal welfare indicator and predictor for discard survival (Supporting Video 1, https:// doi.org/10.14284/399). Participants were also informed about relevant factors in the catchand-discarding process that potentially stress fish and result in weaker reflex responses, namely prolonged periods of air exposure on deck, among others. Following the lecture, participants were trained on example video clips showing, for each of the four reflexes, an 'absent', 'weak', 'moderate', or 'strong' reflex response (Table 1; Supporting Video 1, https://doi.org/10.14284/ 399). The key criterion associated with each response category was read out loud and given to each participant in the form of a pictogram handout (Fig 2A). These training clips were unique and not used again within the video clips that were scored by the participants.
Video clips were used to test whether the same rater ('intra-rater repeatability'), or different raters ('inter-rater repeatability') were able to repeat the same score of the same fish, and whether a rater's score could have been influenced by knowing how much a given fish was exposed to air prior to its reflex test (intra-rater repeatability with testing an expectation bias effect). A total of 36 video clips of the four reflex responses of common sole were picked out of a reference library of video clips, representing a range of impairment states filmed inside a laboratory (N = 5 fish), or on-board a commercial beam trawler (N = 5 fish; Table 2). Overall, clips included reflexes across the categorized spectrum of responses (ranging from absent to strong). Three experienced expert raters who were involved in the development of the reflex scoring methodology scored the 12 unmodified, original clips that were used to create the scoring video used during sessions 2-4 (Fig 3). It showed that the selected clips did not bias reflexes towards either weak or strong responses.
To address intra-rater repeatability and bias related to expectation of an effect from prolonged air exposure, between 12 and 16 of the 36 clips were duplicated ( Table 2). All duplicates were mirrored or at least slightly modified in Adobe Photoshop by increasing their brightness levels to mislead the viewers in assuming all clips were unique. Onto each clip, the true or falsified number of minutes the fish spent on deck exposed to air prior to the reflex test ('air exposure') was labelled, together with a date and time stamp. To falsify air exposure times, an arbitrary 15 min or 30 min were added to the true value. These air exposure periods were chosen to i) represent conventional commercial catch sorting times, and ii) to increase the expectation bias potential for each rater. The greater value (i.e., longer air exposure time) was chosen to increase the likelihood of the rater being influenced by this information if the rater had a preconceived idea about the effect of air exposure on vitality.

Data and analyses
To analyse whether a reflex response was biased toward lower or higher tVAS score when an elevated air exposure was falsely indicated on the clip, each rater's scores of duplicate clips were compared with a linear-mixed model (LMMs; lme4 package in R; [28] with as fixed effects: between i) a rater's expectation of an effect of prolonged air exposure on reflex responsiveness, ii) the experience level of each rater, iii) and the reflex type and all possible interactions. Random effects were included for the ID of a given clip, and a rater's ID. The Tukey method was used to compare each reflex for corresponding pairs of duplicate clips shown with either false or true air exposure. These were evaluated by each rater's expectation group (expecting positive, and/or no or negative impact from air exposure) and experience level (no, <100, or �100 vertebrate animals previously assessed for reflex responsiveness). A significance level of 0.05 was applied.
Inter-and intra-rater repeatability were estimated based on inter-and intra-rater reliability coefficients [29][30] which were implemented using the irr-package [31]. To estimate interrater repeatability, all clips with true air exposure information were included in the dataset (scoring sessions 1-7; Table 2), also arbitrarily stratifying all raters by their reflex rating experience, and calculating across all included clips or specifically per reflex type. To estimate both inter-and intra-rater repeatability on the same dataset, only duplicated clips with true air exposure information were included in the dataset (scoring sessions 5-7; Table 2). The intraclass correlation coefficient (ICC) is based on the ratio of the variability among rater's reflex scores over the sum of this variance plus error, thus ranging between 0 and 100. A higher value of ICC reflects a higher agreement among the raters for a given clip or per reflex type. The ICC measure of association was estimated using the psych package in R [32]. In this study, we report the ICC for a single random rater [29].

Results
In total, 436 participants scored video clips during the seven dedicated scoring workshops and produced 13,676 scores, because not all participants were equally able to score each of the 36 Table 1. List of scoring criteria for categorical reflex responses (i.e., absent, weak, moderate, and strong) of common sole (Solea solea) in the order tested within 5 s of observation after stimulus (based on [4,9]).

Reflex
Stimulus Absent Weak Moderate Strong

Body flex
The fish is held outside the water on the palms of two hands (touching each other) with its belly facing up and its head and tail unsupported.
No active movement, the body rests limp on the hand.
Tail is moving slightly, but not beyond the plain of the hand.
Tail is flexing beyond the plain of the hand. Body may move-spastic flexion; or slowly slipping off the hand.
The fish is actively trying to move head and tail towards each other; or quickly slipping off the hand.

Righting
The fish is held underwater at the surface on the palms of two hands (touching each other) with its belly facing up and then is slowly released.
Fish drifts and sinks passively to the bottom of the container.
Fish appears stunned, but rights itself very slowly.
Fish appears stunned, but starts to turn after a delay. The rotation can be swift.
Fish actively and quickly turns underwater.

Head
The fish's head is held between thumb and index finger, with either belly or dorsal side facing up.
No movement. The body dangles motionless.
The fish may move its tail slightly.
The fish may exhibit a cramp-like flexion, but no clear curling, nor repeated bending.
Fish immediately and repeatedly curls around fingers.

Tail grab
The fish's tail is held between thumb and index finger.
Fish does not struggle free; it remains motionless upon release.
Fish does not struggle free; no swimming movement, but swims away upon release.
Fish does not struggle free, but moves its body as if it attempts to swim away.
The fish actively struggles free and swims away.
Intensity of a response increases from absent to strong. The speed of a response for weak and moderate categories may be delayed; for strong it should be immediate.  (Table 3). The majority (N = 401) were unexperienced raters (i.e., never previously scored animals for reflex responsiveness). Of these, the majority were students, but some were scientists, technicians, observers, and practicing veterinarians/food safety inspectors. Fourteen and 21 participants had scored some (<100 animals) or �100 fish (i.e.,'experienced') reflexes before, respectively. One of the raters with some experience had observed behavioural responses among seabirds and seals, but not fish.

Expectation bias
The dataset that included scores of duplicated clips with either true or falsified air exposure information comprised 3,525 scores which were assigned in workshop sessions 2-7 to duplicate clips by those participants who indicated an expectation about the effect of air exposure on reflex responsiveness (Table 2). Scores by participants from session 1 were not included, because not all duplicated clips were paired by true/false air exposure information (Table 2). Based on histogram data indicating a clear distinction at greater and less than 30, a positive  expectation (i.e., air exposure would exacerbate reflex impairment) was set at <30, and a negative expectation (i.e., air exposure would reduce reflex impairment) was �30 (Fig 4). Of these scores, 70% were scored with a positive expectation by the participant (Fig 4). An expectation of the effect of air exposure on reflex responsiveness did not bias the scoring of reflex clips. The null hypothesis (i.e., no difference in scores due to air exposure information) was not rejected for raters who expected air exposure to positively affect reflex impairment (N = 128; Table 3). Overall, these raters were not more likely to assign a lower score to a duplicated clip that showed falsified air exposure (extra 15 or 30 min) compared to the original, which was stamped with the true air exposure time (Table 4). Generally, where available, the median scores followed what the three expert raters had assigned to each clip ('silver standard' score), although for some clips scored by raters with some or experienced raters, their median values were off the mark compared to the silver standard ( Fig 5). Nevertheless, for some duplicated clips that were scored by raters with some prior reflex scoring experience, lower scores were assigned to clips as postulated by our null hypothesis (i.e., duplicates with IDs 3_10a & b; Fig 5C). But this difference was not significant (Table 4). In advance of scoring, some raters expected that the reflex would not be affected by air exposure or would even become stronger (N = 85; decreased impairment = negative expectation, Table 3). This aligned with clips of the body flex reflex, for which raters with no reflex assessment experience consistently scored higher for clips with falsified compared with true air exposure (Table 4; Fig 5B). This contrasted our null hypothesis.

Intra-and inter-rater repeatability
When quantifying inter-rater repeatability (dataset included scores of all clips with true air exposure information, some clips were duplicated; 6,664 observations), raters with different experience levels in scoring reflex impairment were able to reproduce the same score for a given clip when scored independently in different scoring sessions with an intra-class correlation coefficients of 76% (68% 84%, lower and upper confidence interval, CI). Participants who had no prior scoring experience produced a lower intra-class correlation coefficient (ICC = 76%, 68% 84% CI) compared with participants who had scored at least some fish throughout their career (ICC = 79%, 71% 87% CI). However, the latter sample size was rather small (N = 29) compared to 396 raters with no experience who were considered in this analysis. A similar pattern resulted when comparing ICC values per reflex type. For example, for the tail grab reflex, raters with at least some experience scored more consistently than raters with no experience (ICC = 86%, 76% 94% upper and lower CI vs ICC = 81%, 69% 92% upper and lower CI, respectively). Similarly, but with a less prominent difference, for the head reflex, raters with at least some experience had an ICC value of 79% (63% 93% lower and upper CI) compared to 78% (61% 93% CI) for raters with no experience. Including only seagoing observers and those experts who developed this methodology, increased the ICC (ICC = 83%, 67% 95% upper and lower CI).

Session
No were the scores for the body flex, regardless of experience (ICC = 15%, 0.07% 46% upper and lower CI versus ICC = <1%, -11% 37% upper and lower CI, respectively for raters with none or at least some experience). The dataset which included duplicated clips with only true air exposure information, to calculate ICC of both intra-and inter-rater reliability comprised 3,664 observations. Across all reflexes, relatively high ICC values of 74% were achieved for inter-and intra-rater reliability, for both. For individual reflexes, highest ICC of both intra-and inter-rater reliability (for both the values were almost the same and differed from beyond the third decimal) were achieved for head (92%), tail grab (78%), righting (45%), and by far the lowest ICC was achieved for body flex (<1%). N of unique raters is indicated above each bar as it was marked by a rater on their scoresheet in response to a question whether a reflex response would weaken or strengthen when the animal was knowingly exposed to air for a prolonged period (15-30 min).
https://doi.org/10.1371/journal.pone.0229456.g004 Table 4. Tukey comparisons of the least-square mean (lsmean) ± SE reflex score of a given reflex type which was scored by a rater with a certain experience and a positive (1) or negative (0) expectation.  Clips were duplicated within a scoring video and imprinted onto the screened clip with either false (an added 15 or 30 min to the true value) or true air exposure information. A rater's expectation (scored on a scale of 0 to 100) of the effect of prolonged, onboard air exposure on a fishes' reflex responsiveness was categorized as to whether it would result in either a weaker (<30; positive expectation; 1) reflex response or no effect (�30, no or negative/wrong expectation; 0). Our hypothesis was that clips imprinted with false air exposure information would receive a lower score than their duplicate shown with the true value, as the fish would have been weakened from additional air exposure (positive expectation). Groups with the same letter were not significantly different at p = 0.05.   , which refers to duplicated clips of the same fish and reflex with either true or false air exposure information; or 'Intra-rater reliability' (IOR), which refers to duplicated clips of the same fish and reflex and always true air exposure information. Where available, dots indicate the 'silver standard' scores which were averaged across three experienced, expert raters who scored 12 unmodified, original clips. https://doi.org/10.1371/journal.pone.0229456.g005

Discussion
There is a global effort to determine the limitations and strengths of methods that profile fish condition related to fishing impacts and survival prediction [21,33]. This study examined whether vitality information is reliable based on the involvement of multiple raters and/or on their experience level, and whether scoring repeatability can be influenced by knowing the treatment a fish has received. Results suggest that vitality assessments using reflex responsiveness are robust. In regard to expectation bias, there was no evidence that the exaggerated air exposure information influenced intra-rater repeatability. Regardless of their experience, raters were not misled to assign lower reflex scores to fish which they believed were exposed to air for a prolonged period of time, even when they expected air exposure to positively impact reflex impairment. This does not mean that other variables cannot invite expectation bias; however, it does suggest that perhaps when focusing on a specific metric over a short time frame, the rater does not subconsciously bias their assessment, especially when the scoring criteria (here between absent and present) are unambiguous.
These results are promising if reflexes are to be used in settings with multiple, independent raters and/or with raters who do not have a strong background in reflex assessment. We do however acknowledge that this study was done through video clip analysis rather than having participants handle fish. There is the possibility that tactile experience in fish handling or reflex scoring could result in inter-rater variability among scores. However, [9] found no inter-rater differences when multiple participants scored the same live fish for reflex impairment.
While rater experience in conducting reflex assessments did not bias the scoring outcomes (similar to results from [9]), results suggest that bias is potentially more likely to be introduced through subjective reflexes than raters; especially when reflexes were to be presented as <30 sec long video clips. This includes reflexes that are difficult to assess or that elicit responses that are difficult to discern between presence and absence; or reflexes such as body flex and righting which during evaluation were rapidly tested in succession of each other. This supports the need for researchers to scrutinize the selected reflexes that will be used for a vitality study in advance of data collection based on a screening for consistent and unambiguous candidate reflexes among unstressed fish [4], to be deliberate about scoring metrics (i.e., binary vs. continuous scoring), and ideally, establish a concrete physiological link between a stressor and reflex impairment to validate underlying hypotheses that such links exist [20][21]. Ideally, during data collection, each rater should be blinded and unaware about any prior treatments a study animal may have received, likewise an analyst should be unaware of who did the scoring [15]. Attention has to be paid when editing video clips accordingly. Experience may contribute to a subjective interpretation of scoring criteria, when pre-gained routines and self-made 'rules' may bias an assessment.
This study also supports the use of video-taped reflex assessments that can be reviewed at a later time. This has implications for allowing multiple assessments of the same video and to include reviewers who are unable to go to sea for each field trial. It also is beneficial for training purposes to minimize handling of fish. While there is evidence that untrained raters are capable of rating as or even more accurately as experienced raters, for future studies using reflex impairment as a vitality metric, we recommend having a substantial training programme for raters which includes protocols with clear and meaningful definitions, scoring of videos with pictogram-based handouts, repetitive training sessions and continued repeatability checks [34]. In addition, if video assessments are performed, it is helpful to have sheets describing the reflexes in front of the raters, constraining a fixed amount of time to observe each clip, and ensuring only one reflex is shown in a video clip at a time. There is also the potential to have a video shown on a touch screen where the rater could be more in control of viewing; however, time to review should be limited.
Blinding and intra-and inter-rater reliability analyses are relevant concepts which should be considered for robust inference within experimental fisheries science, especially where many independent raters are involved. For example, when fish otolith are read for their age (e.g., [35]) or when using vitality indices to evaluate welfare and/or freshness of catches either on-board vessels or at fish auctions. Among domestic farm animals, such assessments are routinely done (e.g., [36]).