Development of a fixed list of terms for the Qualitative Behavioural Assessment of shelter dogs

The shelter environment may have a severe impact on the dogs’ quality of life, and there is thus a need to develop valid tools to assess their welfare. These tools should be sensitive not only to the animals’ physical health but also to their mental health, including the assessment of positive and negative emotions. Qualitative Behaviour Assessment (QBA) is a ‘whole animal’ measure that captures the expressive quality of an animal’s demeanour, using descriptive terms such as ‘relaxed’, ‘anxious’, and ‘playful’. In this study, for the first time, we developed and tested a fixed-list of qualitative QBA terms for application to kennelled dogs. A list of 20 QBA terms was developed based on literature search and an expert opinion survey. Inter-observer reliability was investigated by asking 11 observers to use these terms to score 13 video clips of kennelled dogs. Principal Component Analysis (PCA) was used to extract four main dimensions explaining 70.9% of the total variation between clips. PC1 characterised curious/playful/excitable/sociable demeanour, PC2 ranged from comfortable/relaxed to anxious/nervous/stressed expression, PC3 described fearful demeanour, and PC4 characterised bored/depressed demeanour. Observers’ agreement on the ranking of video clips on these four expressive dimensions was good (Kendall’s W: 0.60–0.80). ANOVA showed a significant effect of observer on mean clip score on all PCs (p<0.05), due to few observers scoring differently from the rest of the group. Results indicate the potential of the proposed list of QBA terms for sheltered dogs to serve, in alignment with other measures, as a non-invasive assessment tool. However, the observer effect on mean PC scores points towards the need for adequate observer training, particularly in live scoring conditions. The QBA scoring tool can be integrated with existing welfare assessment protocols for shelter dogs and strengthen the power of those protocols to evaluate the animals’ experience in shelters.


Introduction
Thousands of dogs around the world are kept in temporary or permanent confinement for a variety of reasons and in a way that may affect their welfare [1,2]. There is evidence that shelter environments may have a severe impact on the quality of life of dogs [3,4]. This is likely due to factors such as social isolation and novel surroundings [5], especially if these are protracted over long periods of time [6]. For this reason, there has been an increasing interest by the scientific community to develop validated tools to assess the welfare of sheltered dogs [3,7].
Over the last decades, the concept of animal welfare has evolved from focusing primarily on the animal's physical health and ability to cope with its environment [8], to recognising that animals are sentient beings capable of experiencing positive and negative emotions [9]. It is now accepted that animals, though healthy, can nevertheless experience negative emotions when housed in unsuitable environments [10]. In addition the concept of animal welfare has shifted in recent times from focusing on the reduction of negative emotions (e.g. fear, pain) to ensuring that captive/domestic animals are also experiencing positive emotions (e.g. pleasure, happiness) [11,12]. It is therefore essential that animal welfare assessment tools include measures of positive welfare.
The potential of qualitative methods for assessing behaviour and welfare to play a role in such developments, and, alongside quantitative methods, to contribute to useful measures of positive animal welfare, has been a subject of review and discussion [13,14]. Qualitative Behavioural Assessment (QBA) is one such method which to date has generated a substantial body of research assessing emotional expressivity in a range of animal species [15][16][17][18]. QBA is a 'whole-animal' approach measuring not so much what the animal is doing physically, as how it is performing these behaviours, i.e. the expressive style or demeanour with which the animal is moving [19]. QBA uses a range of positive and negative qualitative terms such as 'relaxed', 'sociable', 'anxious' or 'fearful' in order to quantify an animal's experience in different situations and conditions [19].
The descriptive terms used in QBA can either be developed through an experimental procedure known as Free-Choice Profiling (FCP), in which each assessor generates his/her own terms based on the observation of animals in different situations [20,21], or are provided in the form of a pre-determined list of QBA terms given to observers to assess animals. When standardisation of measurement tools is required, for example for the purpose of on-farm welfare inspections, the use of a fixed-list designed for the purpose of welfare assessment in a particular species is more feasible than FCP. Pre-fixed QBA protocols are for example included in Welfare Quality 1 protocols for cattle, poultry and pigs [22] and in AWIN protocols for sheep, goats, horses, and donkeys [23][24][25], where they serve as a measure for positive emotional state.
Initial validation of the application of QBA to dogs was done by Walker et al. [26,27]. In these two studies Walker and colleagues applied FCP for the first time to assessing emotional expressivity in working dogs (all Beagles), and in shelter dogs in home pen and novel test pen environments. Outcomes showed, in the different dog groups, overlapping dimensions of emotional expressivity such as confidence/contentment, anxiousness/unsureness, alertness/ attentiveness or curiousness/inquisitiveness. Walker et al. [27] also showed significant and meaningful correlations between QBA dimensions and quantitative ethogram-based behavioural measures, suggesting the potential value of QBA as a welfare assessment tool for dogs. Arena and colleagues [28] further investigated this potential by applying FCP to shelter dogs in a wide variety of shelter environments and social contexts including outdoor/indoor pens, single/pair/group housing, and the presence/absence of human activity. Results showed that by presenting more complex contexts and giving the animals more opportunities to express a wider repertoire of emotions, the observers generated 'richer' expressive dimensions than was the case in Walker et al.'s [26,27] studies which used relatively standardised experimental settings.
The present study aimed to build on these outcomes by developing a fixed-term QBA terms list that might be integrated with existing on-shelter welfare assessment protocols. Adding QBA could provide information complementary to that provided by quantitative measures, extending a protocol's power to identify and detect emotional shifts in dogs across the positive and negative emotional spectrum [7,27,28]. Recognising that QBA relies on context-specific qualitative judgment of behavioural expression [19], terms included in a pre-fixed list should be representative of the large range of behavioural expressions that dogs could potentially show in variable kennel conditions. To achieve this goal, we generated a list of suitable terms based on the available dog behaviour and welfare literature, and then used an expert opinion survey to refine this list into a final set of 20 QBA terms. The inter-observer reliability of these terms was subsequently investigated by instructing eleven observers to score the emotional expressivity of dogs viewed in a sample of videos reflecting a wide range of shelter conditions.

Ethics statement
No special permission for use of dogs in such behavioural studies is required in Italy, since dogs were observed during their daily life and within their familiar shelter pens. When dogs were exposed to people, no physical interaction was required. We sent a formal request to carry out the study to all shelters managers and received permission from all of them.
All procedures were performed in full accordance with Italian legal regulations and the guidelines for the treatments of animals in behavioural research and teaching of the Association for the Study of Animal Behavior (ASAB).
No IRB approval was sought for observers participation as observers, but they provided informed signed consent to participate in the study and they were fully informed about the purpose and background of the study. Students ranged in age from 25 and 34 years.

Selection of terms
Literature review. In a previous QBA study by this research group [28], 13 observers generated a number of terms to describe the emotional expression of shelter dogs from videos, using a Free Choice Profiling (FCP) methodology [15]. The FCP in Arena et al. [28] generated over 50 terms and the analysis extracted three main consensus dimensions. From that work, we selected 16 high-loading terms characterising each of the three main dimensions (Table 1). These 16 terms were used as a starting-point for the current study. Additional qualitative terms were selected from the relevant scientific literature on dogs' emotions [26,27,29], behaviour [30], personality and temperament [31][32][33][34], and from other QBA research [13,35,36]. Thus, nine new terms were added (affectionate, aloof, attention-seeking, boisterous, excited, explorative, happy, self-confident and serene), creating a preliminary list of 25 terms. Expert opinion. To check for the appropriateness, relevance and ease of understanding of the 25 terms on the preliminary list, we performed an expert opinion survey. For this purpose, we selected a panel of 15 international experts in the field of dog personality and behaviour, shelter dog welfare and QBA methodology. The panel was composed of 12 females and 3 males. A one-round on-line survey was arranged using the SurveyMonkey 1 platform. In an introductory letter, experts were told that the goal of the project was to develop a comprehensive list of terms for qualitative assessment of the emotional state and welfare of dogs housed in different shelter environments. Eight out of 15 experts answered the survey anonymously.
For each of the 25 terms on the list, the authors provided a brief semantic characterisation, similarly as done previously for donkeys, horses and goats [25,37,38]. Such characterisations cannot be regarded as sharply delineated definitions similar to those provided for conventional ethograms, but rather are brief depictions containing both qualitative and quantitative elements that together should illustrate the meaning of a term, as appropriate for particular species and contexts. In the present study the Oxford English Dictionary (online version) was used as a starting-point, and terms were then further adapted using information from previous QBA studies on dogs [26][27][28], and previous QBA studies using fixed lists [25,37,38]. The experts were asked to score each term on the basis of four brief statements using a likert-scale from 1 to 5, where 1 corresponded to 'not at all', and 5 to 'completely' ( Table 2). Experts also had the possibility to add comments in a free space if wanted. Furthermore, at the end of the survey the experts could suggest a maximum of three terms that, in their opinion, were missing but that they considered important to describe the emotional state of sheltered dogs. This expert opinion was used to refine our original list by deleting and/or adding terms, in order to eliminate synonyms, select the terms most relevant to describing shelter dog emotional state, and ensure that terms were easy to understand.
Generating the final list. The SurveyMonkey 1 software automatically generated a matrix with the average scores assigned by the experts to each term. We then calculated the mean scores of each of the four statements for all terms ( Table 2). To decide whether to keep or delete a term, we used those total mean scores to establish thresholds for each statement: to be considered for inclusion, a term had to receive a mean score of 3 or more in statements #1, #3 and #4, and less than 3 in statement #2. Furthermore, we summarised and classified the negative comments provided by the experts as 'prone to anthropomorphism'; 'too generic'; 'too similar to other terms'; 'is not an emotional state'. We put aside inapplicable comments such as those repeating the same concept expressed in the statement (e.g. "not useful to assess animal welfare"), those referring to the definition given by the authors (e.g. "I would include 'interest' in the definition") and comments without justification (e.g. "I suggest to delete this term from the list").
Finally, we defined inclusion/exclusion criteria in order to refine our list. As exclusion criteria, we established that a term would be excluded from the list if it had at least: • two insufficient scores • one insufficient score and one negative comment • no insufficient scores, but three or more negative comments.

Inter-observer reliability
The final list of 20 terms resulting from the previous steps was subsequently used to test for inter-observer reliability. Animals and video-clips. A convenience sample of dogs in 13 kennels were videorecorded in seven different Italian shelters: three in Northern Italy (Emilia-Romagna and Piemonte Regions), two in the Centre (Abruzzo Region) and two in the South (Puglia Region). The dogs had been housed in the kennels on a long-term basis (> 6 months since admittance), either singly, in pairs, or in groups of 3 dogs or more. In Italy, group housing is a common practice. The aim was to record a sample of these dogs in different scenarios to capture a variety of dog behavioural expressions. The dogs were video-recorded in their kennels, with sound, using a smartphone (Samsung GT-I9100P) that had been mounted on a tripod positioned a few meters from the fence. Thus, a set of 13 video clips was obtained, with 4 clips showing singly housed dogs, 4 clips showing dogs housed in pairs, and 5 clips showing dogs in groups of 3 or more.
To generate a variety of behavioural expressions in the dogs, three scenarios (common in a shelter environment) were created, and the dog/s in each kennel was/were recorded for 2 minutes during one of these scenarios: 1. under normal conditions with no external intervention, 2. in the presence of an unknown person, or 3. in the presence of a familiar person. The unfamiliar person was one of two researchers (one female and one male), while the familiar person was a shelter operator available at the time of recording. The unfamiliar person followed a simple protocol, approaching the fence of the kennel and standing one metre from the fence ignoring the dog/s for 1 minute, and subsequently talking gently to the dog/s moving a hand slowly along the fence for another minute. Shelter operators were asked to enter the kennel and interact with the dog/s for 2 minutes. For the scenario of normal conditions, the dog/s were filmed for 2 minutes without any external intervention. In the two intervention scenarios the persons were visible in the videos. Scenarios were assigned to kennels so as to obtain a balanced distribution of scenarios across kennels and dog groupings.
The 2 minute clips thus obtained were edited (using the Avidemux 2.6.8 programme) so as to ensure that all dogs in a kennel (whether housed singly, in pairs or in groups) were visible throughout a clip, resulting in a set of 13 clips with an average length of 1.5 minutes. This final set of clips consisted of 5 clips in normal conditions with no external intervention (2 single dogs, 1 pair, and 2 groups), 4 clips with an unknown person (1 single, 2 pairs, and 1 group), and 4 clips with a familiar person (1 single, 1 pair, and 2 groups).
Observers and observation session. The observation session was carried out in Italy. Eleven participants to a course for dog trainers were recruited as observers (8 females and 3 males). All observers had previous experience with dogs and owned at least one dog. Five of them were dog trainers, four were volunteers at dog shelters, two had graduated at the Faculty of Animal Welfare and Protection, and one was a veterinarian. None of the observers had previous experience with QBA.
Before starting the assessment of the video clips, approximately 1 hour was dedicated to an introduction with the goal of explaining the aim of the study and the operative procedures. Observers were given a list of the QBA terms to be used for scoring with a brief characterisation of the meaning of each term (see paragraph above for a description of how these characterisations were generated). These characterisations were translated onto Italian to facilitate scoring by the observers. Observers had the opportunity to discuss and align their understanding of these terms.
Observers were told that the aim of the study was to investigate whether observers can agree in using a fixed list of QBA terms to assess emotional expressivity in shelter dogs. Emotional expressivity was defined as an animal's dynamic style of interaction with its environment, conspecifics and humans (how the animal behaves as opposed to what it does). Observers were told that when scoring a pair or group of dogs housed together in a kennel, this was to be done not by assessing individual dogs and then adding them up, but by assessing the overall level of intensity at which a certain expressive quality was present in the interactions of the whole group. Each observer was provided with scoring sheets (one for each video clip) on which Visual Analogue Scales (VAS) of 125 mm of length were placed next to each term. Participants were told they must score the dog/s in each clip on every qualitative term on the list. They were instructed on how to use the VAS for scoring: the left end of the scale corresponded to the minimum score (0 mm), meaning the expressive quality indicated by the term was entirely absent in that dog or group of dogs, whereas the right end represented the maximum score (125 mm), meaning that the expressive quality indicated by the term was strongly dominant in that dog or group of dogs. Observers were asked to avoid talking during the session.
Once instructions had ended, observers watched the 13 clips projected onto a lecture hall screen and, after each clip, had approximately 2 mins to score the dogs' expressions on the rating scales, by drawing a vertical line across the VAS at the point they felt was appropriate. A score was assigned to each term for each clip by measuring with a ruler the distance in millimetres between the minimum point of the VAS and the point where the observer marked the line.
Statistical analysis. IBM SPSS Statistics 21 software (IBM Corp., 2012) was used for statistical analysis. The QBA scores gathered by the 11 observers for the 13 video clips on 20 QBA terms were all analysed together as part of one Principal Component Analysis (PCA, correlation matrix, varimax rotation). Before running the PCA, we explored the sampling adequacy by checking Kaiser-Meyer-Olkin (KMO) and anti-correlation image matrix values. PCA is a form of multi-variate analysis that can be used to reduce a large set of possibly correlated variables (QBA terms) to a small set of linearly uncorrelated variables (principle components or PCs, also referred to as dimensions), that still reflects most of the information contained in the large set. In other words, PCA serves to identify patterns of difference (dimensions) that explain most of the variation between units (video clips) as measured by a large set of variables (terms). The PCA assigned scores to each video clip for each of the 11 observers on each of the main extracted PCs/dimensions, and these scores were used as input for the analysis of interobserver reliability in 2 different ways. First, how well the ranking of video clip scores from low to high on each of the main PCs was aligned between different observers was determined using Kendall's W coefficient of concordance [37]. Kendall's W values can vary from 0 (no agreement at all) to 1 (complete agreement), with values higher than 0.6 showing substantial agreement. Secondly, whether observers differed in the quantitative range occupied by clip scores on each PC, and the mean value resulting from that range, was investigated using a oneway ANOVA with observer as fixed factor and video clip as random factor. Post-hoc multiple comparisons (using Tukey HSD test) and homogeneous subsets were inspected to check which and how many observers were in disagreement. Finally, we also assessed alignment between observers in their ranking of clips on each separate QBA term, using the Kendall's W test.

Generation of the list of terms
Mean scores attributed to each statement and for each term by the panel of responding experts are summarised in Table 3.
On the basis of the inclusion/exclusion criteria, nine terms were deleted from the list: boisterous, aloof, timorous, tranquil, serene, apathetic (all had at least two insufficient scores, and Table 3. Mean scores attributed by the experts to each term for each statement. #Q: key words "comprehensible", "ambiguous", "useful" and "relevant" correspond to the statements as listed in Table 2. some also had negative comments), affectionate, self-confident (one insufficient score and at least one negative comment), happy (at least three negative comments). We evaluated the additional terms suggested by the panel of experts, and we added interested, depressed and aggressive (each one, suggested three times) and reactive (suggested twice). Finally, the term uncomfortable was replaced by its positive form 'comfortable', because this avoids having to score a double negative (i.e. lower scores implying an animal to be not at all uncomfortable) and poses less risk of confusion for observers. As a result, the final list was composed of 20 terms. (Table 4). Overall this list was balanced in term of their semantic meaning (positive versus negative emotional state) and level of arousal they expressed (high versus low arousal). As before, brief characterisations of each QBA term were produced using procedures described in the methods above.

Terms
Inter-observer reliability of terms KMO was well above the threshold of 0.50 (0.83) and the anti-image correlations were also high for all individual variables (> 0.55). After visual inspection of the scree plot, four components were extracted which together explained 70.9% of the total variance and all with Eigenvalues >1 ( Table 5). The observers' agreement in the ranking of video clips on each of these four dimensions ranged from 0.60 to 0.80, all significant at p < 0.001 (Table 5). Table 6 shows the loadings of each term on the four principal components (PC). PC1 was characterized by positive terms curious/attention-seeking/playful/excited/sociable/interested and explorative; PC2 characterized dogs as ranging from comfortable/relaxed to anxious/nervous/stressed. Fig 1 shows the 20 QBA descriptive terms plotted along these first two PCs. PC3 was characterized by the terms fearful/hesitant/wary, and finally, PC4 by the terms depressed/ bored. Fig 2 shows the 20 QBA terms plotted along PC3 and PC4. ANOVA showed that there was a significant effect of observer on mean clip scores on all 4 PCs (PC1: F = 13.7, p < 0.001; PC2: F = 3.5, p < 0.001; PC3: F = 15.6, p<0.001; PC4: F = 6.4, p<0.001). This means that although observers were in agreement on the ranking of clips on the 4 PCs (see above), they differed with respect to the mean value generated by PCA for clip scores on the 4 PCs. Inspection of post-hoc multiple comparisons indicated that for PCs 1, 3 and 4 most observers (>7) presented homogeneous means, with the remaining 3-4 observers scoring differently from the rest of the group. The observers scoring differently were often the same across all PCs. Post-hoc analysis for PC2 highlighted one clear outlier (observer 10), with the remaining observers not differing from each other in mean clip score. When running the ANOVA again, without observer 10, the observer effect on PC2 was not significant (F = 1.62, p = 0.12).
The Kendall W values for separate QBA terms were all significantly different from chance (p<0.001), but only 8 terms reached values higher than 0.60, while the term 'depressed' showed a particularly low value (Table 7).

Discussion
The aim of this research was to develop and test the inter-observer reliability of a fixed Qualitative Behaviour Assessment (QBA) rating scale for the purpose of field assessment of emotional expressivity in dogs in a shelter environment. Attempts had previously been made to develop and validate welfare assessment protocols for kennelled dogs [4,7], however, these protocols did not include measures addressing animals' emotions. To our knowledge, our study is the first to generate a fixed QBA rating scale for the assessment of shelter dogs' emotion. A previous Free Choice Profiling study served as a basis for generating an initial list of terms [28], and subsequently we searched the literature and consulted expert opinion to create a final list of 20 Qualitative Behavioural Assessment for shelter dogs terms covering as much as possible the relevant behavioural expressions appropriate for the assessment of dogs in kennels. Eleven observers then used this list to score kennelled dogs in 13 video clips, and PCA of these scores generated 4 Principal Components (PCs) explaining 70.9% of the total variance. PC1 was characterised by the terms 'curious/attention-seeking/ playful/excited/sociable', describing the animals' interest in their environment and their engagement with people and pen-mates. PC2 ranged from 'comfortable/relaxed' to 'anxious/ nervous/stressed', a demeanour indicating a state of comfort versus anxiety of dogs in the kennel environment. PC3 describes fearful demeanour summarised by the terms 'fearful/hesitant/ wary', whilst PC4 is characterised by the terms 'depressed/bored', which indicates a negative mood of lower energy than fear, anxious and stress. Reference to emotional states with positive and negative valence has previously been made in assessments of shelter dog welfare: Titulaer et al. [39] for example used cognitive measures to assess the effect of long versus short term Qualitative Behavioural Assessment for shelter dogs permanence in kennel, whereas Mendl et al. [40] demonstrated how shelter dogs suffering from separation anxiety were more likely to be experiencing a negative affective state. As discussed in the introduction, Walker et al. [27] used comparable QBA dimensions to describe the emotional state of dogs in kennel versus home environment and compare short versus long term housing. Nevertheless, most of the scientific literature on shelter dogs' welfare has so far mainly relied on quantitative ethogram-based behavioural measures (e.g. time spent walking, sleeping etc.) and physiological indicators [41][42][43]. These four dimensions of dog emotional expressivity address dynamic aspects of welfare including important subtle differentiations, such as that between relaxation and depression, or between emotionally positive and negative excitement (excited vs nervous). From a 'wholeanimal' perspective, the aim of QBA is not to identify a minimal set of core descriptive terms, but to capture wider patterns of expression and their context through a larger range of terms. Three terms (i.e. reactive, alert and aggressive) did not load highly on any of the four extracted PCs. Looking at Figs 1 and 2 it appears these terms co-load with stress and anxiety on PC2, and group together on PC4 opposite to 'bored/depressed', thus appearing to mostly reflect a tense reactivity associated with a negative emotional state in dogs struggling to cope with the environment. However, terms such as 'alert' and 'reactive' do not in themselves have a strong negative connotation-animals in positive playful mood can also be alert and reactive, or even mildly aggressive, and this likely explains why these terms often do not necessarily load highly on either positive or negative ends of emotional dimensions. This, however, does not make them superfluous, they can still serve to support and specify patterns of positive and negative mood in dogs assessed in different situations. Overall, the four PCs identified by this study cover the four quadrants of emotional expression defined by valence and arousal axes which tend to be typical of dimensional models of affect [44], and as such should be expected to offer a comprehensive assessment tool of dog emotional expression.
An important element of any measurement tool is its reliability, i.e. the extent to which measures provide consistent results when applied by different assessors [45,46]. To reflect QBA's whole-animal character, reliability is primarily investigated at Principle Component (PC) level, using the PC scores generated through PCA, rather than scores on separate QBA terms. Each of the 4 PCs provides scores for the 13 video clips for each of the 11 observers, and so these scores could be compared. As described in Methods (line 246, this was done in two ways: 1. comparing the ranking of the 13 clips on each PC, i.e. how clip scores were ordered from low to high on each PC, and 2. comparing the quantitative range occupied by clip scores Qualitative Behavioural Assessment for shelter dogs on each PC, and the mean value resulting from that range. In the present study, agreement on ranking the 13 clips was good for all 4 PCs, (W = 0.60-0.80), which concurs with previous studies testing the reliability of QBA fixed lists in a range of species [25,47,48], though some recent studies indicate that in field conditions as opposed to video-based conditions, achieving good agreement on ranking of animals on QBA dimensions can be more difficult [48,49]. There was however a significant effect of observer on the mean value of clip scores on all 4 PCs, suggesting consistent differences in how observers quantified the various QBA terms using VAS scales. Post-hoc analysis, however, revealed that the majority of observers (60-70%) did not differ significantly in mean clip score on the 4 PCs, with fewer than 4 observers diverging significantly from this mean. Some of these outliers tended to be the same, suggesting these observers had generally different scoring habits across all terms. For PC2 mean scores were quite homogeneous, in fact, there was only one outlier. When running the analysis without that observer, no significant difference between the remaining observers was found for PC2. The higher homogeneity between scores associated with PC2 could be due to PC2 being the only dimension for which high loading terms described a clear contrast between positive and negative expressions. For the other three dimensions the emphasis was on either a positive or a negative expression, providing less conceptual anchoring for observers to place their scores on the visual analogue scale, making variation between observer scores more likely.
Other studies have reported variation in 'scoring styles' between observers (e.g. [45]), creating observer effects on mean animal scores on some PCs.
As noted above, observer effects at PC level will mostly be due to differences in how observers use VAS scales to quantify animal expressions on separate QBA terms, and it is therefore useful to also inspect reliability at the level of separate terms. Most QBA studies report low agreement for at least a few terms, even when overall agreement at PC level is good (e.g. [25]). In the present study, 12 out of 20 individual terms showed moderate agreement between 0.50 and 0.60, while 3 terms (depressed, explorative, aggressive) fell below 0.50. The lack in agreement for these 3 terms may be due to the low prevalence of these expressions in the video clips, or, more generally, may indicate that observers found these expressions difficult to assess. Recognising aggression for example has also been difficult for observers in other studies of dog expression [30,50]. However low initial reliability does not necessarily imply a term should be removed from the list, as it may reflect an important aspect of an animal's experience and wellbeing. Instead efforts could be made to improve a term's descriptive characterisation and provide better training materials.
Generally, the issue of validity, i.e. the question to which extent observers are capable of correctly identifying dog emotional expressions, remains open to debate [29,46]. Some studies found most observers to be reasonably skilled while others did not (e.g. [30,50,51]. One relevant factor might be the method in which observers are presented with information. Some studies instruct observers to focus on particular details of a dog's expression (e.g. facial expression, play), and may present these in relatively fixed format (e.g. photographs or verbal descriptions), whereas other studies ask observers to rate the whole moving animal in different contexts shown on video [45]. QBA falls in the latter category, hypothesizing that observing the dynamic whole animal in context provides more expressive clues for interpretation than focusing on isolated behaviours and signals (e.g. [35]). The current study supports the viability of this approach, but also indicates the need for attention to specific terms and aspects of expression that may detract from the overall reliability of the QBA tool. This is particularly true when scoring animals live, in the far more variable conditions of practical applied settings, where observers are more likely to be distracted and influenced by different aspects of the physical and social environment [48,49]. However, research on QBA assessment in donkeys and goats under field conditions has shown that training observers, i.e. taking time to discuss the meaning of individual terms and to compare how these terms are used for scoring, while located in the field conditions, can significantly improve the reliability of such terms [25]. Furthermore, as dogs are known to respond differently to male and female observers [52], it is important to adjust any live training exercises for such differences, so that an observer's gender does not unduly bias a dog's assessment. Thus, it is clear that any future practical application of the QBA term list for shelter dog welfare proposed in this study should provide ample training (including field assessments), until consistently high agreement between observers can be reached.
In recent efforts to promote the experience of positive emotions in captive/domestic animals (e.g. through appropriate environment enrichment), welfare scientists have directed a lot of attention toward the validation of outcome (i.e. animal-based) measures that could be integrated into welfare assessment tools to ensure that standards of positive welfare are maximised [53]. QBA has been successfully applied to farm animal welfare assessment protocols [22] and it has the potential to provide a valuable measure of dogs' behavioural expression in confinement. Effective interventions to minimise poor welfare in shelter dogs should be based on the evaluation of dogs' individual ability to cope and adapt to confinement in kennels. QBA focuses on the dynamic expressivity of behavioural demeanour, describing and quantifying the emotional connotation of this expressivity that we naturally perceive. The comprehensive range of qualitative terms developed in this study may function as an integrative screening tool of kennelled dogs' welfare, with special attention to positive welfare. A low expression of certain terms might prompt the implementation of certain types of enrichment that may have been neglected. For example, it has been demonstrated that stress upon entering a rehoming shelter can be mitigated by short sessions of positive human interaction [5]. Hence, low overall scores on the terms characterising PC1 may raise awareness for the need of targeted socialisation programs on animals struggling to adapt to the novel kennel environment. The QBA pre-fixed list developed in this study could be used as a monitoring system to detect early warning signs of reduced welfare state in kennelled dogs, as well as detect areas of improvement to ensure good quality of life.

Conclusions
The present study aligns with previous QBA studies on other animal species in finding mainly good inter-observer reliability for a fixed-list of QBA terms applied to video-based assessments of dogs living in a shelter environment. Agreement in ranking dogs on the four expressive dimensions was good, but a few observers produced significantly different mean scores for observed dogs on the four main PCs, indicating a need for training and alignment of observer 'scoring styles' [47]. The scientific community recognises that welfare is a complex multidimensional concept, and that no single indicator can be considered exhaustive to evaluate the welfare of animals. It will be preferable therefore to apply QBA in conjunction with other physiological, behavioural or health indicators for animal welfare [17]. The QBA scoring tool developed and tested in the present study can potentially be integrated into existing welfare assessment protocols for shelter dogs, and strengthen the power of those protocols to assess and evaluate the animals' experience [7]. However, to ensure acceptable inter-observer reliability of these protocols, on-going training in the practical application of the QBA tool should be provided until consistently high agreement between observers can be reached. Further research should clarify whether and how reliable use of this tool for can be achieved.