Repeatability, Reproducibility, Separative Power and Subjectivity of Different Fish Morphometric Analysis Methods

We compared the repeatability, reproducibility (intra- and inter-measurer similarity), separative power and subjectivity (measurer effect on results) of four morphometric methods frequently used in ichthyological research, the “traditional” caliper-based (TRA) and truss-network (TRU) distance methods and two geometric methods that compare landmark coordinates on the body (GMB) and scales (GMS). In each case, measurements were performed three times by three measurers on the same specimen of three common cyprinid species (roach Rutilus rutilus (Linnaeus, 1758), bleak Alburnus alburnus (Linnaeus, 1758) and Prussian carp Carassius gibelio (Bloch, 1782)) collected from three closely-situated sites in the Lake Balaton catchment (Hungary) in 2014. TRA measurements were made on conserved specimens using a digital caliper, while TRU, GMB and GMS measurements were undertaken on digital images of the bodies and scales. In most cases, intra-measurer repeatability was similar. While all four methods were able to differentiate the source populations, significant differences were observed in their repeatability, reproducibility and subjectivity. GMB displayed highest overall repeatability and reproducibility and was least burdened by measurer effect. While GMS showed similar repeatability to GMB when fish scales had a characteristic shape, it showed significantly lower reproducability (compared with its repeatability) for each species than the other methods. TRU showed similar repeatability as the GMS. TRA was the least applicable method as measurements were obtained from the fish itself, resulting in poor repeatability and reproducibility. Although all four methods showed some degree of subjectivity, TRA was the only method where population-level detachment was entirely overwritten by measurer effect. Based on these results, we recommend a) avoidance of aggregating different measurer’s datasets when using TRA and GMS methods; and b) use of image-based methods for morphometric surveys. Automation of the morphometric workflow would also reduce any measurer effect and eliminate measurement and data-input errors.


Introduction
Morphological characteristics have been of fundamental importance in biology since the beginnings of the discipline. Indeed, the taxonomic classification of organisms [1] and the first steps in understanding the evolution of life [2] both came about through morphological descriptions of different forms. Morphologic investigations compare and analyse "meristic" and/or continuous "measureable" morphometric variables [3]. In the latter case, the morphometric characteristics selected are translated into numeric values so they can be analysed using appropriate statistical methods [4,5]. Morphologic investigations can be applied at various levels. Until recently, for example, morphological methods were generally used to differentiate species [6,7,8] or to describe intraspecific differences, such as sexual dimorphism [9,10] and/or population level detachments [11,12]. Moreover, morphometric surveys can be applied to the entire body [13,14] or to individual body parts, e.g. a fish scale, vertebra or otolith [15,16,17,18], depending on the goal of the survey. Over the last century, however, new morphometric methods have been developed. The oldest of these, the "traditional" distance-based method (hereafter TRA), measures the size (e.g. minimum body height, eye diameter, head length) and/or distance between specific body parts (e.g. prepelvic distance) of an individual [19] (e.g. see Fig 1A) and analyses these data further. The instruments used to measure distance will vary (e.g. tape measure or calipers) and the variables selected for measurement will vary depending on the size and the shape of the individual surveyed.
The TRA method was used exclusively until the introduction of the box-truss network (hereafter TRU) method at the begining of the 1980s [20]. While the TRU method also uses distance data, it relies on specifically identifiable, homologous points in order to eliminate many of the uncertainties inherent in the TRA method. For example, in ichthyological studies, individual variation in body shape may result in a shift of maximum and minimum body height (Fig 1A) along the fish's body. The distance between the base of the dorsal and anal fin (see: Fig 1C) can be more precisely definied, however, as the distance is homologous in each individual measured. For more datails see [21].
A new family of morphometric methodolgies was developed at the end of the 20th century [22]. These 'landmark-based geometric methods' use coordinates of homologous points taken from digital images of the study objects and the data transformations and standardisations necessary require a strong computational background. As the use of personal computers has become wide-spread, however, such methods have quickly become the most widely used [23], though distance-based methods are also still applied [24,25].
Despite the limited information available on the suitability and/or sensitivity of the different methods available, it is a generally accepted 'fact' that geometric methods display higher separative power than distance-based methods [26,27]. Moreover, geometric methods are also generally considered less destructive, faster and cheaper than TRA methods [28,29,30]. While there have been some methodological studies assessing the quality of data and results obtained using these methods [31,32,33], very few have compared and quantified the applicability of the different methods now used in ichthyological studies (see: [34,35]); and those that have were usually been unable to provide any statistical confirmation. It is a generally accepted tenant of science that unrepeatable or unreproducible measurements have no validity [36]; yet there is still very little information available regarding the repeatability (i.e. closeness of agreement between independent results obtained on identical subjects using the same method under the same conditions) or reproducibility (i.e. closeness of agreement between independent results obtained using the same method on identical subjects but under different conditions [37,38]) of the different morphometric methods used today in ichthyology. Moreover, numerous studies compare or combine the datasets of different measurers [35]; meaning that, in most cases, the results will be more or less biased by a 'measurer effect', despite the use of welldefined protocols [39,40,41]. As this phenomenon is generally well-recognised, it is frequently advised that the data of just one measurer is used, especially in the case of population-level studies [42,43,44,45]. Less well-known, however, is how measurer variability affects the results of different morphometric methods and how this can change based on a species' physical characteristics.
In this study, our aim is to survey the applicability of four different morphometric methods based on five aspects: 1) repeatability and 2) reproducibility of measurement, 3) separative power, 4) degree of measurer effect and 5) how such features change depending on the physical characteristics of the different species analysed. Morphometric landmarks and distances recorded by the three measurers. A: Distances measured when using the "traditional" (TRA) method (blue lines): 1-height of head, 2-preorbital distance, 3-postorbital distance, 4-head length, 5-prepectoral distance, 6-length of pectoral fin, 7-prepelvic distance, 8-length of pelvic fin, 9-predorsal distance, 10-length of dorsal fin, 11-preanal distance, 12-length of anal fin, 13-length of caudal peduncle, 14-minimum body depth, 15-maximum body depth, SL-standard length. The oval and arrow indicate the scale sampling area. B: codes for the seven landmarks recorded on scales (orange): 1-left cranial edge, 2-cranial end, 3-right cranial edge, 4left caudal edge, 5-focus, 6-right caudal edge, 7-caudal peak. C: codes for the 11 landmarks recorded on the body (red): 1-tip of snout, 2occiput, 3-base of dorsal fin, 4-upper base of caudal fin, 5-lower base of caudal fin, 6-base of anal fin, 7-base of pelvic fin, 8-base of pectoral fin, 9-lower part of head, 10-posterior point of opercule, 11-middle point of eye. Fifteen between-landmark distances were used for truss-network (TRU) analysis, indicated by green letters and lines. Color codes corresponds with Figs 2 and 3. (For the raw datasets see S1-S4 Tables).

Ethics statement
This study was undertaken following all relevant national and international guidelines pertaining to the care and welfare of fish. Fish collection was authorised by the Ministry of Rural Development (Permit no.: EHVF/188-1/2014). All procedures used in this study were approved by the Committee on the Ethics of Animal Experiments of the Hungarian Academy of Sciences' Centre for Ecological Research (Permit no.: VE-I-001/01890-3/2013). During sampling, every effort was made to minimise the suffering of fish and all fish were anaesthetised with a lethal dose of clove oil prior to analysis. No endangered species (according to the IUCN Red List of Threatened Species v. 2015-4 [www.iucnredlist.org]) were caught during this study.

Sample collection and data management
In this study we applied two distance based morphometric methods (TRA, TRU) and two geometric methods using body (GMB) and scale (GMS) landmarks to three common and widespread cyprinid fish species. Thirty specimens each of roach Rutilus rutilus (Linnaeus, 1758), bleak Alburnus alburnus (Linnaeus, 1758) and Prussian (gibel) carp Carassius gibelio (Bloch, 1782) were collected by electrofishing from three sampling sites (Site1: N46.63474 E17.17433, Site2: N46.79983 E17.38822, Site3: N46.75362 E17.56720) in the Lake Balaton catchment area (Hungary) in 2014. The sampling sites were situated relatively close to each other, with a straight-line distance between sites 1 and 2 of 24 km, 33 km between sites 1 and 3, and 15 km between sites 2 and 3. All fish were euthenised with a lethal dose of clove oil before fixing in 4% formaldehyde [46], whereupon the samples were moved to the laboratory and tagged with a unique identification code for further analysis. After tagging, each fish was placed flat on a table surface and the left side was photographed from a perpendicular angle using a tripodmounted Nikon D50 digital camera angle for GMB and TRU analysis. A single scale was then removed from the area anterior to the dorsal fin of each specimen for GMS analysis (see. Fig  1). All scales were placed between two glass slides and scanned with a Hewlet Packard ScanJet 5300C XPA scanner at 2400 dpi.
Eleven easily defined landmarks were recorded on body and scale images for GMB, and seven for GMS ( Fig 1A and 1B), using tpsUtil and tpsDig2 digital imaging software, both of which were specifically developed for digitising landmarks and outlines for geometric morphometric analysis (for more details see: [47,48]). Sixteen inter-landmark distances were recorded on the digital images for TRU analysis, the distances being measured using freeware imageJ software (for more details see: [49]). For TRA analysis, 16 commonly-used taxonomic measurements [50,51,35] were taken from the left side of each individual (i.e. not from a photographic image) using a digital caliper, all data being recorded to the nearest 0.01 mm. In total, 26,730 measurements (three species x three populations x 30 individuals x three measurers x three repeats x 11 variables) were undertaken during GMB analysis, 17,010 measurements during GMS analysis (three species x three populations x 30 individuals x three measurers x three repeats x seven variables), and 38,880 measurements each during TRU and TRA analysis (three species x three populations x 30 individuals x three measurers x three repeats x 16 variables). For definitions of the measured characteristics, and for the most important features of the methods tested and compared in this study see Table 1.
All measurements were undertaken by the same three measurers (M1, M2, M3), the whole sample set being repeated three times. Before measurement started, the three measurers discussed the actual methods of measurement (TRA, TRU) and image marking (GMB, GMS) prior to the survey in order to anticipate any discrepancies originating from different conventions and/or personal methods (see: [52]). All three measurers were right handed, thereby preventing bias in the results from left or right-handedness (see: [53]). During TRA measurement, the measurers paid special attention to avoid discrepencies from individual fish degradation. Since all measurements were made within a relatively short period (two months) after sample collection, the chances of preservative-caused morphometric differences were considered negligible [46]. All specimens measured have been deposited within the institution's fish collection and all fish and images therefrom are accessible from the corresponding author.
To eliminate any size-effect in the TRA and TRU datasets [54], we used the allometric formula of Elliott et al. [55], i.e. M adj = M (L s -L 0 ) b where M is the original measurement, M adj is the size adjusted measurement, L o is the standard length (SL) of the fish, and Ls is the overall mean SL for all fish from all samples in each analysis. The standardised data were rechecked by correlating against the original SL values. For GMB and GMS, a full Procrustes fit was undertaken on the landmark data, followed by multivariate regression analysis on the logarithm of Centroid Size (logCS) [56]. Statistical analysis was performed on the residuals of the regression analysis in order to eliminate any size effect.
For testing repeatability and reproducibility there is no generally accepted protocol [38]. In some cases each measured morphometric variable is tested independently on some selected specimens [57,58,59], or the repeteadly recorded entire datasets of the whole analysed stocks can be compared [60,61]. In our case, we chose to follow the latter method, with repeatability calculated as 'intra-measurer similarity of three independently recorded datasets obtained from the same population by the same method'. Intra-measurer similarity was computed as follows: each of the three datasets derived from the same population was converted into a distance matrix using Euclidean distance and compared by pairwise Mantel-tests [62]. Pairwise comparison correlation coefficients (R) and significance (p) values were used to characterise the level of similarity between the three repeated measurements. Values for R ranged between 0 and 1, with 0 representing complete differentiation and 1 representing complete agreement (repeatability) between the two datasets. The R values were arranged into groups in order to assess repeatability at three different consecutive levels: I-measurer, II-species and IIImethod. Reproducibility was computed in a similar manner, but with inter-measurer similarity of independent datasets compared using the same method from the same population. The R values of pairwise Mantel-tests were then arranged into groups at two consecutive levels: I-species, and II-method. Differences found between groups at consecutive levels were tested using the non-parametric Kruskal-Wallis test.
Separative power and subjectivity (i.e. measurer's influence on the results) were assessed via a datamatrix containing a randomly chosen dataset from the three repeats on the same individual. In each case, the results were analysed using Canonical Variate Analyses (CVA) and twoway permutational ANOVA (PERMANOVA) [63] of Euclidean distance with 9 999 permutations. The analysis was performed independently for each method and for each species. All statistical analyses were carried out using PAST v.2.17c software [64].

Results
Raw data and standardisation process TRA measurement indicated SLs ranging from 61.1 to 130.9 mm for bleak, 64.5 to 135.7 mm for roach and 43.7 to 157.7 mm for Prussian carp. All raw data for GMB and GMS landmark analysis and the variables measured for TRU and TRA analysis are presented in supplementary S1-S4 Tables. None of the variables measured showed any significant correlation with SL data after standarisation; hence, all variables were used for further analysis.  Table).

Repeatability and reproducibility
Mean repeatability values indicated only slight differences between most comparisons at level 1 (measurer), with only one of 36 crosschecks (2.7%) showing significant inter-measurer differentiation (Fig 2, level I). Moreover, most R values derived from different measurer datasets showed a similar range-spread for each species.
In comparison, major differences were found in repeatability at the two higher levels (Fig 2,  levels II and III). At level II (species), mean repeatability (mean ± SD) of GMB measurements on Prussian carp was significantly lower (0.820 ± 0.13) than that of the other two species (roach 0.943 ± 0.02, bleak 0.934 ± 0.05; no significant difference between roach and bleak). On the other hand, both bleak and roach displayed significantly lower GMS repeatability (bleak  Table). Nine pairwise R values, obtained from the same measurer's data, were used for each box. Each box represents the 25% and 75% quartiles while the line in the box represents the median. The whiskers show the highest and lowest values within the dataset. In rows indicated by grey Roman numerals, the datasets were analysed at different levels, i.e. I (measurer; n = 9), II (species; n = 27) and III (method; n = 81). Groups with the same letter did not differ significantly (p < 0.05) using the Kruskal-Wallis test. Color codes corresponds with Figs 1 and 3. 0.658 ± 0.20, roach 0.528 ± 0.20) than Prussian carp (0.854 ± 0.06). Using TRA, significant differences were observed between repeatability values for bleak (0.226 ± 0.11) and Prussian carp (0.363 ± 0.16), but not between roach and bleak or Prussian carp (0.292 ± 0.17). No significant species-level differences were observed in mean repeatability using TRU (bleak 0.508 ± 0.14, roach 0.482 ± 0.14, Prussian carp 0.493 ± 0.11). At level III (method), measurement repeatibility improved from TRA (0.294 ± 0.16) through TRU (0.439 ± 0.13) and GMS (0.680 ± 0.20) to GMB (0.899 ± 0.10), with crosschecks between each method being significant (p < 0.05; Fig 2, level III).
Measurement reproducibility ranged between 0.01 and 0.99, with 882 of 972 pairwise comparisons (90.7%) exhibiting significant correlations. All pairwise comparsions for GMB were significant for all three species investigated, while 239 of 243 (98.4%) were significant for GMS and 230 of 243 (94.78%) for TRU. TRA displayed the lowest number of significant pairwise comparisons, with 170 of 243 (69.9%) (S6 Table). At the species level (I), mean (± SD) reproducibility of GMB measurements on Prussian carp was significantly lower (0.774 ± 0.14) than than that for roach and bleak (roach 0.935 ± 0.02, bleak 0.926 ± 0.05), with no significant difference between these species (Fig 3).
A comparison of repeatability and reproducibility data indicated lower mean values for reproducibility, both at the species and method level, though significant differences were only observed for GMS in all three species investigated (Kruskal-Wallis test, p < 0.05).

Separative power and subjectivity
In almost all cases, CVA analysis indicated significant differentiation of the three study populations. For GMB, 26 out of 27 pairwise population comparisons showed significant isolation (Table 2), with all three populations differing from each other significantly in eight cases out of nine. For GMS, 21 of 27 comparisons were significantly different, and 23 of 27 comparisons for TRU, with all three populations differing significantly from each other in five cases each. For TRA, 21 pairwise comparisons were significantly isolated, and in six cases all populations were significantly isolated from each other. In just two cases, the populations showed no significant detachment (roach-GMS-M1 and Prussian carp-TRA-M2).
Analysis of subjectivity indicated strong differences between the methods studied. Using the GMB method, CVA scatter plots indicated that all three study populations were separated from each other in the same manner by all three measurers, and that this pattern was detected for all three species (Fig 4). Note, however, that the relative positions of the different measurers group centroids were shifted slightly (ghosting) along the y and/or x axes in all cases. A similar effect was noted for both TRU and GMS, though the study populations were much less separated. This was especially true in case of GMS, where differentiation was much weaker and the datasets overlapped much more than GMB for all three species. In the case of TRA, a very different pattern was detected, the group centroids being aggregated according to measurer rather than sampling site (Fig 4).
Overall, therefore, CVA scatterplots indicated measurer impact on differentiation with all methods tested, though measurer role in separation was only important for all three species in TRA analysis. These findings were supported by the results of two-way PERMANOVA analysis, with both site and measurer having a significant effect on population differentiation in most cases, independent of method used (Table 3). In the case of GMB, higher F values were calculated for site for all three species, whereas only Prussian carp showed this pattern using GMS and TRU. Using GMS, the role of measurer was higher than that for sampling site in the differentiation of roach and bleak populations, while the effect of measurer was notably higher than site for all three species using TRA.

Discussion
Our results indicate that all four methods tested were able to detect morphometric differences between the different fish populations, despite the relatively narrow geographic scale. Nevertheless, the features examined showed considerable differences in some cases, with mean repeatability using GMB, for example, three times higher than that for TRA. Our results correspond with those of Parsons et al. [29], who showed that geometric/morphometric methods  Table). Each box presents 81 pairwise R values obtained from a comparison of datasets derived from the same subjects by different measurers. The box represents the 25% and 75% quartiles, with the line in the box representing the median. The whiskers show the highest and lowest values within the dataset. In rows indicated by grey Roman numerals, datasets were analysed at different levels, i.e. I (species; n = 81) and II (method; n = 243). Groups with the same letter did not differ significantly (p < 0.05) using the Kruskal-Wallis test. Color codes corresponds with Figs 1 and 2.
doi:10.1371/journal.pone.0157890.g003 had higher separative power, making them more applicable than traditional, distance-based methods. At the same time, all four methods were more-or-less burdened by different negative effects; hence, all the methods studied had weaknesses and strengths, affecting their applicability.
In most cases, measurement repeatability did not differ between measurers. Hence, as long as measurements are carried out by competent analysts, all methods are equally usable as regards repeatability. At the species level, repeatability and reproducibility showed similar trends, though reproducibility was lower in each case. In roach and bleak, both repeatability and reproducibility decreased from the GMB through GMS and TRU to TRA. For both species, GMB measurements showed > 90% repeatability, which corresponds well with the literature (e.g. [65]). There was no significant difference in repeatability using GMB and GMS for Prussian carp, presumably due to its more characteristic scale shape [66]. Therefore GMS appear to be equally applicable as GMB, as long as the species examined has a characteristic scale shape. Moreover as Staszny et al. [67] discussed, body shape is more influenced by the conditional status of a fish than scale shape; hence, scale shape is less sensitive to short-term environmental effects (e.g. starvation). On the other hand, unlike the other three methods, GMS measurements showed significantly lower reproducability (compared with its repeatability) for each species. In this case, therefore, it would appear important that a single measurer's dataset is used. When using TRU and GMB, however, datasets of different measurers may be combined if the methodology and other influencing factors are the same (e.g. for species level differentiation, supraspecific taxonomic researche using a large number of individuals).
Analysis of subjectivity indicated that all the morphometric methods were influenced by measurer effect to a greater or lesser degree. Even in the case of GMB, which is generally less burdened by measurer effect, the relative position of different measurer group centroids were shifted along the y and/or x axes in all cases. At the same time, TRA was the only method where population level detachment was entirely overwritten by measurer effect. Despite the lack of any significant difference between repeatability and reproducibility using TRA, therefore, calculation of subjectivity indicated that different measurers could have a crucial affect on analysis results; hence, it is recommended to avoid from datasets aggregation of different  measurers in this case. Very low levels of repeatability and reproducibility were detected for TRA in some cases (%1%), possibly due to errors in measurement or during the data entry process (clerical errors). Direct data entry [68] (rather than transfer from paper to computer, as in our case) can reduce the number of data entry errors during fish morphometric survey. TRA repeatability did not correspond absolutely with that of the other three methods tested, possibly as the fish were handled during measurement, the other three methods utilising static images for markings and measurement. In this case, slight differences in the positioning of the fish between measurements may have affected measurement repeatability (for more details see [31]). The use of image analysis techniques instead of actual body measurement, therefore, clearly improved the applicability of distance-based methods.
Our findings (with distance-based TRA and TRU methods showing similar separative power to GMS) partially contradict those of Medebacher [26], who stated that "traditional morphometrics is often at its limit when closely related entities are analysed". On the other   [72] shows that GMS separative power could be strengthened with the use of form (shape and size data) instead of shape alone. Our results draw attention to the importance of measurer skill and expertise, especially when planning morphometric studies. Furthermore, when deciding on the morphometric method to be used, factors such as the selection of variables should be considered alongside the function of specific physical characteristics (e.g. scale shape) for each species examined. Our data showed that, whereas the same set of variables appeared appropriate for differentiating roach and bleak populations, they were less suitabe for discerning Prussian carp populations (see Table 2). For best results, therefore, the most appropriate method and morphometric variables to be used will depend on the species studied. Our study showed that all the morphometric methods tested are appropriate for detecting even population level differences. Athough the methods differed considerably in their sensitivity, separative power and subjectivity, the final results were strongly influenced by attributes of the species investigated and by the measurer's skill and expertise. The considerable impact of measurer effect on the results provides some weight to the need for greater automation of morphometric analysis, including distance measurement and landmark processing, data standardisation and statistical analysis. This would also help reduce the level of measurment and data input errors. The methodologies of other disciplines (e.g. medicine and astronomy) that are already largely automated could prove useful in the automatisation of morphometric assessment [73,74,75], though further methodological studies are needed in order to identify the most appropriate methods (or combination of methods) and measurement variables for individual species and for the goals of individual studies.
Supporting Information S1 Table. Raw dataset of the GMB analyses, for codes see text. (DOCX)