Towards a new osteometric method for sexing ancient cremated human remains. Analysis of Late Bronze Age and Iron Age samples from Italy with gendered grave goods

Sex estimation of human remains is one of the most important research steps for physical anthropologists and archaeologists dealing with funerary contexts and trying to reconstruct the demographic structure of ancient societies. However, it is well known that in the case of cremations sex assessment might be complicated by the destructive/transformative effect of the fire on bones. Osteometric standards built on unburned human remains and contemporary cremated series are often inadequate for the analysis of ancient cremations, and frequently result in a significant number of misclassifications. This work is an attempt to overcome the scarcity of methods that could be applied to pre-proto-historic Italy and serve as methodological comparison for other European contexts. A set of 24 anatomical traits were measured on 124 Bronze Age and Iron Age cremated individuals with clearly engendered grave goods. Assuming gender largely correlated to sex, male and female distributions of each individual trait measured were compared to evaluate sexual dimorphism through inferential statistics and Chaktaborty and Majumder’s index. The discriminatory power of each variable was evaluated by cross-validation tests. Eight variables yielded an accuracy equal to or greater than 80%. Four of these variables also show a similar degree of precision for both sexes. The most diagnostic measurements are from radius, patella, mandible, talus, femur, first metatarsal, lunate and humerus. Overall, the degree of sexual dimorphism and the reliability of estimates obtained from our series are similar to those of a modern cremated sample recorded by Gonçalves and collaborators. Nevertheless, mean values of the male and female distributions in our case study are lower, and the application of the cut-off point calculated from the modern sample to our ancient individuals produces a considerable number of misclassifications. This result confirms the need to build population-specific methods for sexing the cremated remains of ancient individuals.


Introduction
The practice of cremation emerged in Europe over an extended period, starting from at least the Mesolithic [1,2], and stabilizing in the Neolithic [3], but during the Copper Age-and even more so during the Bell Beaker period-there was a rapid acceleration in its uptake [4]. Cremation cemeteries appear in the Danubian-Carpathian basin and in the Central Mediterranean from the 3rd millennium BC, but the archaeological record shows relatively small funerary areas, [5]. During the Bronze Age, the transition from inhumation to cremation permeates many areas of continental Europe [6,7]; by the end of the second millennium BC, large "urnfields", including hundreds (and sometimes thousands) of graves, widely representative of the living community, have become predominant [8][9][10][11][12].
In Northern and peninsular Italy, the transition from inhumation to cremation, with some rare exceptions, unfolds over roughly five centuries, from the Middle/Late Bronze Age to the Early Iron Age [8]. In general, from extremely austere practices that excluded most grave goods (particularly weapons), in order to hide the social status of the deceased (i.e. the gender, age, role or rank), the ritual becomes progressively elaborate [13]. From the final phases of the Bronze Age (ca. 1000 BC), cremation burials include a wide range of offerings and grave goods that emphasize the identity and status of the deceased [8,[14][15][16]. Three major obstacles have therefore long inhibited the socio-biological analysis of the urnfields: the overwhelming number of burials, which necessarily requires huge analytical efforts; the fragmentary nature of the human remains; and the ritual dissimulation manifest in most Bronze Age cremation burials, especially those prior to 1000 BC.
Given the difficulty of assessing sex from fragmentary, burnt skeletal remains, for which sexually dimorphic morphological features are often inaccessible, archaeologists often assign gender, which is a proxy of sex, based on grave goods.
Obviously, gender estimation based on grave goods may not correspond to sex since the first is a social construct and the latter is a biological feature. Moreover, those burials without grave goods frequently, therefore, remain undetermined for sex, and the social structures and dynamics of related populations remain, to an extent, unclear.
Our aim is to test new, more objective and reproducible osteometric strategies to facilitate sex estimations in ancient human calcined remains.
Further incentive for the development of new sexing methods comes from recent advances in cremation studies, especially in relation to the determination of demography and human mobility through strontium isotope analysis [17,18]. This field of research urgently requires the creation of a more solid framework within which to explore sexually differentiated patterns.

Sex estimation of cremated remains
Heat-induced modifications (fragmentation, warping and dimensional changes) of bones strongly affect the applicability of sexing techniques (both morphological and metrical) developed from/for unburned samples. Few experimental studies on contemporary cremations have attained an acceptable degree of reliability (up to 88% of correct diagnoses) for morphological sex assessment based on skull and pelvis features [19]. However, the most informative traits for sexing skeletons can be lost or significantly modified by heat [20][21][22]. In the archaeological experience, the rate of underrepresentation is usually high: in the Iron Age Pontecagnano-Colucci sample (N = 40), only 20 individuals presented at least one skull indicator and only seven had at least three; the mastoid was observable in eight cases, the glabella in three. For the pelvis, frequency was even lower with only 10 individuals presenting at least one diagnostic element [23]. Since the morphoscopic methods require the observation of several skeletal features, their reliability is strongly reduced in the cremation contexts.
Cremation affects not only morphologies but also has a significant impact on bone size as registered by experimental studies, each with different outcomes [24][25][26][27][28][29][30][31][32][33]. Buikstra and Swegle [27] reported less than 6% shrinkage at temperatures higher than 800˚C, whereas most studies demonstrate a variable reduction of up to 25% of the original bone size takes on exposure to 700˚C and above, depending on the anatomical area [34].
This evidence prompts two important questions: first, whether the effects of heat can affect intra-sample sexual dimorphism; second, the degree to which the application of conventional metric standards to cremated remains is reliable. Most of the studies indicate that when all the bones are burned at the same conditions (such as duration of the process and temperature) the intra-sample sexual dimorphism is maintained [35][36][37][38][39][40][41][42][43][44]. Gonçalves and Gonçalves et al [32,33,45] observed significant levels of sexual dimorphism for several postcranial variables in a sample of modern cremated individuals from Portugal. Nevertheless, on the same sample, he recorded very low values of correct sex classifications when using the standard cut-off points recommended by Wasterlain and Cunha and by Silva for unburnt skeletons [46,47]. As might be expected, as consequence of the dimensional changes (mostly referable to the shrinking effect), the misclassification of males exceeds that of female. Misclassifications range from 30.4% using the maximum length of the calcaneus, up to 77.3% using the transverse diameter of the femoral head. Furthermore, these strong differences between variables may be responsible for intra-individual inconsistency in the sex diagnosis. This phenomenon is mostly attributable to differential effects of the cremation process on various skeletal parts, likely linked to their relative position with respect to the heat source, their specific bone structure and anatomy, amount of soft tissues, and presence of personal items [48].
The first attempt to create specific standards for 126 cremated individuals dates to the experimental study of Gejvall [35], who analyzed the degree of sexual dimorphism of seven cranial and infracranial metric variables in a contemporary, cremated sample of known sex. While the approach has proved to be valuable, the Author himself expressed reservations about the extensive use of his data for sexing unknown individuals. Indeed, the inadequacy of Gejvall values in relation to the Italian protohistoric series has been recently demonstrated [49].
Other scholars have investigated the potentiality of osteometry for sex determination of burnt skeletons. Van Vark [37] and Van Vark et al. [42] successfully tested the discriminatory power of several cranial and post-cranial features on a sample of 251 modern North European individuals. Other studies were performed on different samples taking into account specific sets of variables with different outcomes [40,43,[50][51][52] Promising results were recently obtained on modern Portuguese cremations [32,53]. The Authors developed a set of univariate osteometric standards-on humerus, femur, talus, and calcaneus-achieving successful results in the application of cut-off points and logistic regression equations [53].
The main aim of our own analysis was to test the potential of a large set of metric variables and evaluate their discriminatory power for sex estimation. Samples of cremated remains from five protohistoric Italian necropolises were considered, using the "known gender", as inferred from the archaeological record, as the proxy for sex. In the absence of census data and written sources for the period, this represents the only viable strategy for building population-specific reference distributions. The results provide a baseline for further analyses on new and old osteological collections.
We are aware of the distinction between sex and gender, whereby the former is defined as a universal biological category and the latter is a cultural/social construction that varies in time and space [54][55][56]. If the classifying variable is not totally independent and varies with context, it seems likely that, at least in this chronological and geographical framework, sexual identity will coincide with gender in the vast majority of the cases, as testified by a significant correlation between archaeological materials and osteological data, both for cremations and inhumations [57-59].

Materials
Osteometric data were collected for a total of 124 adult individuals from the Final Bronze to the Iron Ages (50 males and 74 females; Table 1; S1 Table), whose remains are held at the Service of Bioarchaeology at the Museo delle Civiltà (P. le Marconi 14, 00144, Rome), where documentation about burials and skeletal materials is also available. According to Italian legislation, no permits were required for the described study, which complied with all relevant regulations. The archaeological sites included in the study were: Narde di Frattesina (burial groups Narde 1 and Narde 2) [16,60], Chiavari [61], Narce [62], Castenaso, Pontecagnano [63] (see Fig 1 and Table 1 for geographical and chronological details respectively). Narde di Frattesina represents the most ancient necropolis (12th-9th century BC), while Chiavari is the most recent (7th-6th century BC).
The individuals were selected using the following criteria: (1) burials with one single individual, to avoid misleading determinations; (2) only adult individuals (older than 20 years), with fully developed skeletons and all the epiphyses completely fused [64]; (3) bone chromatism ranging from white calcined, to gray, typical of a complete cremation (above 700˚C), to restrict the variability of dimensional changes [65,66]; (4) bones free from osteoarthritic alterations or other visible skeletal pathologies; (5) burials including a substantial number and/or quality of gender-specific shaped urns and/or grave goods (weapons and razors for men; spindle whorls, simple-arch or "leech" fibulas, faïence or glass beads for women), providing a strong indication of gender (Fig 2).

Methods
The osteometric analysis considered 24 variables ( Table 2; Figs 3 and 4). Selection of variables was based on four criteria: (1) they are from skeletal elements that show a high rate of preservation in cremains; (2) they are characterized by easily detectable landmarks; (3) they show a good degree of sexual dimorphism in unburnt skeletons; (4) they were considered in previous studies.
Measurements were taken in mm by two independent observers using a digital caliper; the technical error of measurement (TEM) and the relative technical error of measurement (RTEM) were calculated ( F and Bartlett's tests were used to evaluate the null hypothesis of no difference between the variances of the traits in males and females [69]. For each variable, mean and standard deviation were calculated and a t-test for independent samples was run to verify the statistical difference between the means for archaeological males and females. The estimation of sex dimorphism was carried out using the approach of Chakraborty and Majumder [70], which calculates the areas of non-overlap (D) in the two normal distributions, derived from the means and standard deviations by sex for each trait and the cut-off point (x0) for sex estimation (S1 Appendix).
A cross-validation approach was adopted in order to validate the discriminatory power of each trait [71]. Through this resampling method it was possible to estimate, in a consistent way, the accuracy and the precision of each metric trait in the sex assessment.
The cross-validation analysis was run for each trait by: (1) random selection of a training set of individuals, corresponding to around 70% of the whole sample, with both genders     equally represented; the remaining 30% of the individuals formed the test set, to be classified; (2) calculation, for the training set, of the Chakraborty and Majumder D and the cut-off point for the anatomical trait; (3) sex classification of the test set, according to the cut-off point determined for the training set; (4) creation of the 2x2 confusion matrix as follows: the cell in row 1 and column 1 (CA) represents the number of archaeological females correctly classified as females; the cell in row 2 and column 2 (CD) represents the number of archaeological males who are correctly classified as males; the cell in row 1 and column 2 (CB) represents the number of archaeological females who are incorrectly classified as males; and the cell in row 2 and column 1 (CC) represents the number of archaeological males who are incorrectly classified as females; (5) calculation, from the confusion matrix, of the accuracy of sex determination as the sum of CA plus CC divided by the total number of individuals in the test set; (6) calculation, from the confusion matrix, of the precision of sex determination for females as CA divided by the number of females in the test set; (7) calculation, from the confusion matrix, of the precision of sex determination for males, as CA divided by number of males in the test set; (8) repetition of steps 1 to 6 1000 times, and calculation of the mean and the standard deviation for accuracy, precisions and cut-off points.
The analyses were performed with the R language and environment for statistical computing version 3.4.2 [72] and the caret package [73] for the test/training partitions of the dataset.

Results
Results for the inter-observer error estimate are reported in Table 2. All variables, with three exceptions, show a RTEM (relative index of inter-observer differences) below the acceptance threshold of 0.05. The three variables showing an unacceptable level of error are the height of the dens of the axis (AX-D-H), the humeral head transverse diameter (HU-H-TD), and the mandible condyle thickness (MAN-C-TH). These were therefore excluded from the subsequent analyses. Table 3 presents the descriptive statistics for each metric variable and the results of the Chakraborty and Majumder test ( [70]; D-value), cut-off points and t-test.
Sample size is highly variable across the metrical traits, with differences between sexes: the average number of observations is 15 for males and 19.4 for females.
Results of the F and Bartlett's tests show that for all the variables the differences between the male and female variance are not statistically significant (p>0.05).
Student's t-test displays high significance for all traits (p<0.05), except the dens transverse diameter of axis and head-neck length of talus, with a p-value of 0.24 and 0.40 respectively.
Chakraborty  Table 4 reports the accuracy and precision for the traits as derived from the cross-validation analysis. Eight variables out of 21 show an accuracy (concordance between estimated skeletal sex and archaeological gender) that exceeds or equals 80%, so that, using the calculated cut-off points, a mean of 8 out of 10 individuals is expected to be accurately classified as male or female. These variables are: radius head diameter (88.3% of accurate determinations), patella maximum width (86.0%), mandibular condyle width (83.5%), talus trochlea length (83.2%), femural head vertical diameter (81.3%), dorso-plantar width of the head of the first metatarsal (80.6%), lunate length (80.2%), humeral head vertical diameter (80.0%). Radius head diameter is certainly the most dimorphic trait, as the potential of misclassification is below 12% for both sexes with a cut-off point of 18.32 mm. A further nine variables show an acceptable percentage of accuracy (higher than 70%), while the remaining four traits are unreliable (values between 48.5% and 69.2%). Some measurements taken on the same bone show distinct levels of discriminatory power. Patella width and thickness seem to be a reliable discriminant (PA-MXTH accuracy = 74.0% and PA-MXW accuracy = 86.0%), while height is less so (PA-MXH accuracy = 69.2%); the dens anteroposterior diameter of axis shows a higher accuracy (67.8%) than the transverse diameter (56.5%).
The precision in sex estimation for each single variable is generally greater for females (14 traits) than for males (7 traits).

Discussion
The aim of this study was to test the applicability of univariate metric techniques for sex diagnosis of cremated individuals. From the initial set of 24 variables selected, three were subsequently excluded from the analysis due to an unacceptable level of error at the TEM calculation. Overall, the results demonstrated that ancient calcined bones can preserve a good degree of sexual dimorphism that is not biased (or only minimally) by the augmentation or shrinkage of the heating process, as already reported by previous studies [35][36][37][38][39][40][41][42]44,53].
Eight of the 21 analyzed variables showed a degree of accuracy in the sex assessment that was equal to or greater than 80%, a value generally considered a benchmark for evaluating the utility of a determination method [74,75]. The most discriminatory measurements are located on the radius, patella, mandible, talus, femur, first metatarsal, lunate and humerus. Radius head diameter is the most dimorphic trait, as the potential of correct classification-with a cut-off point of 18.32 mm-is 88.1% for the males, 88.5% for the females and 88.3% for pooled sexes. Towards a new osteometric method for sexing ancient cremated human remains When comparing the discriminating power of the present analysis with results offered for the same measurements on unburned modern series of known sex, the estimates are broadly of the same order (Table 5) with few exceptions. Within the 13 variables compared, 10 present an accuracy level very close to or even exceeding those reported by other studies. This is the case for the radius head. A study by Barrier and L'Abbé [76] on a reference collection of 400 unburned individuals of known sex obtained an accuracy of 80.7%. In the study by Berrizbeitia [77], for the same measurement, sex is correctly identified for 83% of the sample, but this method included a 3 mm non-diagnosis interval (from 21 to 23 mm). The patella maximum width (the second most dimorphic variable in our series) yields 86.0% of correct diagnoses, a value far exceeding other estimates [78][79][80]. Three variables perform worse than the unburnt comparative samples: the humeral head vertical diameter, the lunate maximum width and the talus head-neck length.
For comparison with cremated samples, we must currently rely on only the study of Gonçalves et al. [53] on contemporary Portuguese cremated individuals and the study of Van Vark on contemporary Swedish [38].
The three variables used by Gonçalves (and by our own study) are the vertical head diameters of humerus and femur and the maximum length of the talus. As shown in Table 6, the Towards a new osteometric method for sexing ancient cremated human remains Late Bronze Age/Iron Age Italian series regularly show lower mean values for both sexes; the differences range from -0.64 mm (maximum length of the talus in the females) to -3.14 mm (humerus vertical head diameter in males), plausibly a consequence of a difference in the body mass between diverse populations. An even greater difference exists with the two traiys measured by Van Vark. The indexes of sexual dimorphism (D values) and the reliability of sexual estimate are slightly lower for the protohistoric sample. Nevertheless, the application of the modern series cut-off points (recalculated through the Chakraborty and Majumder's method) to our ancient samples (Table 7) clearly yields lower percentages of correct classifications for the humerus head diameter (50% of misclassification in the males using Gonçalves et al.'s method and even 90% with Van Vark's) and for the femur head diameter (30% of misclassification in the males using Gonçalves et al.'s method and 60% with Van Vark's). By contrast, the 33.3% level of incorrect diagnosis for the talus applying Gonçalves et al.'s method is very close to the result obtained by the cross-validation analysis of the ancient sample (see also Table 6). Overall, the results suggest caution in applying standards based on contemporary specimens-whose value is nonetheless unquestioned for forensic cases and relatively modern populations-to chronologically and geographically distinct samples, and reinforces the need to build population-specific standards.
This study contributes to the debate on the extraction of demographic profiles from cremated human remains [81][82][83]. The unpredictable and extremely variable number and nature of observable traits in the cremains has been seen as a limitation in standardizing analytical procedures, minimizing the reproducibility and comparability between different contexts and researchers. Indeed, the majority of anthropological contributions to archaeological research are insufficiently detailed to provide a clear understanding of the methods applied in sex assessment (see for instance [84]), and are frequently relegated to a brief "osteological appendix". In this respect, Gonçalves and Pires [85] conducted a survey on the consistency of approaches and methodologies used among researchers in the analysis of cremation contexts. On a sample of 84 published papers, 95% reported individual sex assessments, but fewer than 30% had applied metric methods. The Authors claim that osteometry on cremains is still fundamentally mistrusted by researchers, given the fact that metrical criteria specifically developed from/for cremated remains are scant [35,42,43,53]. Our study has proved that the osteometric approach is indeed a feasible approach and should be further investigated and applied in archaeological contexts. Moreover, this method can be used on very fragmented individuals, when morphologies are not easily readable. Its application can yield better results when considered together with other sex indicators, such as the morphological traits of skull and pelvis. The initial assumption that gender is the only viable proxy for sex (necessarily unknown in prehistoric cremated populations) should not be seen as a limitation but, instead, as an opportunity. From a purely interdisciplinary perspective, this method (or any further development of it), would enable the detection of outliers, namely unusually robust individuals with typical feminine grave good assemblages or, vice-versa, unusually gracile individuals archaeologically characterized as men. For these individuals estimated sex and gender might not coincide and may need further discussion. Without any common, replicable metrical base, researchers may abandon ambitions to explore the relationships between sex and gender and the socio-demographic dynamics of Bronze Age and Iron Age populations.
Despite our cremated collection comprises different populations, both in terms of chronology and geography, this diversity does not seem to affect negatively our results and the good degree of sexual dimorphism encourages further development of the osteometric technique. We might assume that if larger homogeneous samples were available, results could have been even more significant.

Conclusions
Gonçalves and Pires assert that, "if bioarchaeologists hope to approach broad crosscultural themes and simultaneously understand the chronological and geographical diversity of cremation-related funerary practices [. . .], they need to rethink and standardize their procedures" [85]. In this vein, the present study begins the task of building new osteometric methods for the sex estimation of ancient human cremated remains, using 124 Late Bronze Age and Iron Age adult individuals from Italy, whose gender is clearly indicated by grave goods.
Our results demonstrate that the most dimorphic traits are located on the epiphysis of the long bones, carpal and tarsal bones, the first metatarsal, patella, and mandibular condyle, and that the accuracy of diagnoses is broadly similar to those obtained on unburnt series of known Towards a new osteometric method for sexing ancient cremated human remains sex, with few exceptions. The mean values and cut-off points are significantly lower than those for modern cremated samples. Nevertheless, indexes of sexual dimorphism show the same degree of male/female dimensional difference, making us more confident about estimating the sex from human cremated remains. As long as sample sizes allow it, the use of population-specific metric references appears to be a useful procedure to provide more objective and reproducible sex attributions, or in cases where grave goods are completely absent. This approach cannot be used in small samples though. For such cases, data obtained by this research may eventually be used as reference. The next step is to validate these references on analyses involving other pre-and protohistoric European skeletal series. In the future, we intend to enlarge our sample size, and the number of metric traits (especially on carpals and tarsals), but also trying to develop multivariable approaches that could reinforce the accuracy of sex estimation and clarify the relationship between sex and gender on a more objective basis.
Supporting information S1 Table. List of burials and measurements. Cells marked in blue indicate that the trait is "masculine" compared to the cut-off point (x 0 ); cells marked in pink indicate that the trait is "feminine" compared to the cut-off point.