## Figures

## Abstract

The lifestyle of spinosaurid dinosaurs has been a topic of lively debate ever since the unveiling of important new skeletal parts for *Spinosaurus aegyptiacus* in 2014 and 2020. Disparate lifestyles for this taxon have been proposed in the literature; some have argued that it was semiaquatic to varying degrees, hunting fish from the margins of water bodies, or perhaps while wading or swimming on the surface; others suggest that it was a fully aquatic underwater pursuit predator. The various proposals are based on equally disparate lines of evidence. A recent study by Fabbri and coworkers sought to resolve this matter by applying the statistical method of phylogenetic flexible discriminant analysis to femur and rib bone diameters and a bone microanatomy metric called global bone compactness. From their statistical analyses of datasets based on a wide range of extant and extinct taxa, they concluded that two spinosaurid dinosaurs (*S*. *aegyptiacus*, *Baryonyx walkeri*) were fully submerged “subaqueous foragers,” whereas a third spinosaurid (*Suchomimus tenerensis*) remained a terrestrial predator. We performed a thorough reexamination of the datasets, analyses, and methodological assumptions on which those conclusions were based, which reveals substantial problems in each of these areas. In the datasets of exemplar taxa, we found unsupported categorization of taxon lifestyle, inconsistent inclusion and exclusion of taxa, and inappropriate choice of taxa and independent variables. We also explored the effects of uncontrolled sources of variation in estimates of bone compactness that arise from biological factors and measurement error. We found that the ability to draw quantitative conclusions is limited when taxa are represented by single data points with potentially large intrinsic variability. The results of our analysis of the statistical method show that it has low accuracy when applied to these datasets and that the data distributions do not meet fundamental assumptions of the method. These findings not only invalidate the conclusions of the particular analysis of Fabbri *et al*. but also have important implications for future quantitative uses of bone compactness and discriminant analysis in paleontology.

**Citation: **Myhrvold NP, Baumgart SL, Vidal D, Fish FE, Henderson DM, Saitta ET, et al. (2024) Diving dinosaurs? Caveats on the use of bone compactness and pFDA for inferring lifestyle. PLoS ONE 19(3):
e0298957.
https://doi.org/10.1371/journal.pone.0298957

**Editor: **Jun Liu, Chinese Academy of Sciences, CHINA

**Received: **June 18, 2023; **Accepted: **January 31, 2024; **Published: ** March 6, 2024

**Copyright: ** © 2024 Myhrvold et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data and code not published in the manuscript and its Supporting Information files have been deposited in this GitHub repository, which is cited in the manuscript: https://github.com/intvenlab/Diving-dinosaurs.

**Funding: **The author(s) received no specific funding for this work.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

### Spinosaurids discovery

Spinosaurids are Cretaceous-era therapods known for their enormous size, their long, narrow skulls, and the dorsal sails that exemplify *Spinosaurus* and some other genera. When Stromer first described *Spinosaurus aegyptiacus* in 1915 from Upper Cretaceous outcrops in Egypt’s Western Desert, he highlighted the spaced, conical teeth and elongate jaws as crocodile-like adaptations for a piscivorous diet [1]. Similar dietary inferences were made some 70 years later in initial descriptions of two older, closely related spinosaurids, *Baryonyx walkeri* [2] and *Suchomimus tenerensis* [3], from Lower Cretaceous outcrops in England and Niger, respectively. Although in recent years remains of other spinosaurids have been discovered, they do not impact the arguments we address regarding the three aforementioned well-known spinosaurids. The following overview of *Spinosaurus* lifestyle inference is not intended to be a complete or thorough review of all arguments or scientific contributions to the topic but instead focuses on key points relevant to the current study.

### Spinosaurus lifestyle inference

In describing *Baryonyx* in more detail, Charig and Milner outlined what may be termed a shallow-water opportunist lifestyle [4]. Although they considered fish an important dietary component, skeletal features plausibly related to functionality in water deserving of semiaquatic status were absent:

On balance, we still envisage

Baryonyxas mainly a fish-eater. It probably crouched on the banks of lakes, creeks and rivers or waded in the shallows (Frontispiece), and it secured its prey by direct seizure with the jaws and perhaps also by ‘gaffing’. Small fishes would have been swallowed whole, larger ones broken up by the powerful fore-limbs with their huge claws. Fishing, however, was not the only source of food…. If we accept that fish formed a significant part of the diet ofBaryonyx, then we must consider the possibility that the animal led an aquatic or semi-aquatic existence. Nevertheless, its anatomy gives no indication of any modifications towards that mode of life.

In 2014, Ibrahim *et al*. introduced the notion of a “semiaquatic” *Spinosaurus aegyptiacus*, with a lifestyle tied more closely to the water’s edge, on the basis of a partial skeleton from Upper Cretaceous rocks in Morocco [5]:

We describe adaptations for a semiaquatic lifestyle in the dinosaur

Spinosaurus aegyptiacus…. These adaptations suggest thatSpinosauruswas primarily a piscivore, subsisting on sharks, sawfish, coelacanths, lungfish, and actinopterygians that were common in the Kem Kem river system.

These authors noted the downsized, retracted external nares and several unusual postcranial features, which they viewed as enhancing predation while wading and surface swimming using “foot-powered paddling” and “lateral undulation of the tail.” These features, which included reduced pelvic girdle and hind limbs, solid long bones, long pedal digit I, flattened pedal unguals, and reduced caudal articulations, would have limited terrestrial agility. These authors asserted that *Spinosaurus* “must have been an obligate quadruped on land,” based on their calculation of the location of the center of mass anterior to the hips. However, that calculation (made by PCS, one of the current authors) has since been recognized to have erroneously shifted the center of mass forward from the hip by the addition, rather than the subtraction, of the estimated volume of internal air space.

The suite of postcranial features in *Spinosaurus* alluded to above nonetheless clearly distinguish it from the baryonychines *Baryonyx* and *Suchomimus* and from other terrestrial nonavian theropods [5,6]. Ecologically, *Spinosaurus* was envisioned as a semiaquatic piscivorous predator frequenting both land and water, capable of both walking and surface swimming, based on anatomical features in functional analogy to crocodiles, shore birds, and semiaquatic mammals, with the dorsal sail functioning as a “display structure that would have remained visible while swimming” [5].

This more complete view of the skeleton renewed interest in *Spinosaurus*, prompting a series of papers on its lifestyle and functional capacities in water. In 2016, Gimsa *et al*. argued that the dorsal sail may have played an important role in fully submerged underwater swimming and active pursuit of prey [7]. In 2017, Hone and Holtz summarized various viewpoints on spinosaurid diet, function, and habitat preference, arguing for a predominantly piscivorous diet and semiaquatic lifestyle while not ruling out scavenging on land or the use of their forelimbs to “dig for buried” prey [6].

In 2018, Henderson created 3-D models of *Suchomimus* and *Spinosaurus* to examine both terrestrial locomotion and buoyancy [8]. For *Spinosaurus* he estimated that the terrestrial center of mass would have been located in the hip region over the hind limbs, suggesting a bipedal stance similar to *Suchomimus*. With respect to swimming, Henderson’s calculations led him to conclude that *Spinosaurus* could float but was laterally unstable and tended to tip over. Buoyancy from air sacs within the axial column rendered *Spinosaurus* “unsinkable”—the force necessary to submerge the body fully in a dive would be more than the animal could reasonably generate. Henderson concluded that these factors make the swimming locomotion proposed in 2014 by Ibrahim *et al*. [5] implausible. He instead proposed that *Spinosaurus* was more likely identified as following the model put forth by Charig and Miller for *Baryonyx*, summarized in the quote above. In particular, Henderson proposed that *Spinosaurus* could “procure aquatic prey without having to become fully immersed” [8], akin to “gaffing” with forelimb claws as proposed by Charig and Miller [4], in analogy to the ambush predation of fish by grizzly bears [8].

In 2019, Arden *et al*. [9] proposed that spinosaurids were “highly specialized semiaquatic predators” on the basis of cranial features, in particular the position of external nares and orbits. Narial retraction in spinosaurids, however, provides no clear association with an open water aquatic lifestyle, when examined in the light of comparative cranial measurements [10]. Orbital position in spinosaurids, likewise, is similar to that seen in terrestrial nonavian theropods [10].

Then in 2020, the discovery of the high-spined tail of the Moroccan skeleton inspired the aquatic pursuit predator hypothesis, described by Ibrahim *et al*. [11]:

Contrary to recent suggestions

^{10}thatSpinosauruswas confined to wading and the apprehension of prey from around the edges of bodies of water, the morphology and function of its tail—along with its other adaptations for life in water^{7}—point toSpinosaurushaving been an active and highly specialized aquatic predator that pursued and caught its prey in the water column (S7 Fig).

Here Ibrahim *et al*. cite Henderson’s 2018 paper ([8], their ref. [10]), which they reject without directly challenging its methodology or results. Instead, Ibrahim *et al*. rested their conclusions entirely on (1) qualitative anatomical analysis of the new tail specimen and (2) a series of experiments in which 2-D plastic models of several different tail shapes were moved robotically in a water tank. In figures and video of the experiments, the *Spinosaurus* tail-shape model is shown submerged with its sagittal plane vertical and long axis parallel to the water surface and tank bottom [11, Fig 3A, Supplementary Information]. Although Ibrahim *et al*. did not directly specify swimming depth or diving behavior, their experimental tail model was submerged to a depth equal to or greater than half the length of the tail, which they reconstructed with a length of approximately 10 m. *Spinosaurus* thus would have been swimming with its center line at least 5 m below the surface—deep enough for full submergence, including its dorsal sail. In their S7 Fig, they depicted a “swimming pose” of *Spinosaurus* inclined upward at approximately 45 degrees, as if it were swimming toward the surface after a deep dive “in the water column,” as they described in the above quotation. *Spinosaurus*, in their view, was not limited to surface swimming, which they never modeled. Although they never used the words “dive” or “diving” in the paper, there is little other means to fully submerge and pursue prey in the water column.

Ibrahim *et al*. distinguished their 2020 findings [11], which emphasized active pursuit predation in the water column, from the earlier studies [5,6, and others] that proposed what they termed a “partially aquatic, piscivorous mode of life” [11]. In their view, the fully aquatic pursuit predator hypothesis ranked as novel, a hypothesis we agree is distinctively aquatic in interpretation. More recently, Gimsa and Gimsa interpreted their small-scale model results similarly, as supporting their previous hypothesis that “*Spinosaurus* was a capable swimmer with the dorsal sail serving hydrodynamic purposes during submerged swimming” [12]. The fully aquatic pursuit predator hypothesis, nonetheless, was challenged in 2021 by Hone and Holtz by a range of qualitative comparisons and a quantitative comparison of overall skull shape [13]. They suggested that drag would have limited the swimming speed of *Spinosaurus* at the surface or underwater, concluding that the fully aquatic pursuit predator hypothesis is unlikely for a number of reasons [13]:

As a putative aquatic pursuit predator,

Spinosaurushas issues with instability in water, high drag, the position of the eyes and nostrils, low swimming efficiency, strong neck ventriflexion, and isotopic signatures showing extended periods in terrestrial conditions and feeding on terrestrial animals, and there remain questions about its ability to swim and submerge effectively as a whole.

Their conclusion regarding *Spinosaurus* lifestyle was this [13]:

Spinosaurusis therefore best interpreted as shoreline generalist based on the available information. Capable of capturing both aquatic and terrestrial prey, and perhaps an opportunistic scavenger, adultSpinosauruslikely took aquatic prey by standing in shallow water or at the margins of water bodies.

That description of *Spinosaurus* echoed that of Charig and Milner regarding the lifestyle of *Baryonyx* quoted above [4]. Indeed, the “generalist” designation might apply equally to many large theropods. Finally, we note here that the terms “shore” and “coast” (or “shoreline” and “coastline”) usually connote land adjacent to an ocean or sea, whereas we do know from recent finds that *Spinosaurus* roamed far inland [14].

In 2022, Sereno and a group of coauthors (including most authors of the present study) published a study that began with accurate 3-D skeletal models of both *Spinosaurus* and *Suchomimus*, based on all available fossil materials [14]. After building a flesh model of *Spinosaurus* over its skeletal model, with body parts adjusted to estimated densities, they performed biomechanical tests for the various proposed functional hypotheses. The center of mass in their flesh model was found to be located above the acetabulum, supporting bipedal stance and bipedal locomotion on land.

Drag experienced during swimming was calculated, using an estimate of body surface area. Analysis of the surface area of the tail, feet, and hands was performed to enable quantitative estimation of the maximum propulsive thrust that could be generated by *Spinosaurus*, which was found to be quite modest compared to drag. They concluded that if *Spinosaurus* swam, it would have done so very slowly, achieving a maximum velocity of ~ 0.8 m/s in surface swimming and ~1.4 m/s if swimming at a depth of 10 m or more to avoid wave drag. They contrasted these with typical velocities of extant pursuit predators such as dolphins and orcas, which range from 10–33 m/s, and concluded that *Spinosaurus* was far too slow a swimmer to have relied primarily on pursuit predation of fish.

Sereno *et al*. also performed a stability analysis, which showed that *Spinosaurus* could wade if supported by its feet [14]. However, if it waded deep enough that it started to float (>2.6 m), torque of the dorsal sail would cause it to tip sideways, leaving it floating on its side with the waterline roughly parallel to the dorsoventral plane (Fig 3B of [14]). Righting itself would have been impossible, due to the severe limitations on transverse thrust generated by its limbs and tail, their inadequate propulsive force, and their location along the body. They concluded that *Spinosaurus* was at best ineffective as a surface swimmer (free of the substrate), but more likely could not swim at all.

In regards to diving, Sereno *et al*. [14] replicated the “unsinkable” finding of Henderson [8]. Their buoyancy analysis showed that *Spinosaurus* could not generate the thrust needed to counter buoyant forces and fully submerge its body; the estimated thrust from the limbs and tail was too small by a factor of 15 to 25, depending on the buoyancy model used. Nor could *Spinosaurus* remain submerged, even if it were positioned underwater. They concluded that *Spinosaurus* could not fully submerge to accomplish a dive.

These findings provide direct quantitative refutation of the aquatic pursuit predator hypothesis that describes active swimming, diving, and pursuit predation “in the water column.” They do not falsify arguments for predation while wading into water over 2 m in depth or hunting for, or scavenging, terrestrial prey. Sereno *et al*. described the lifestyle of *Spinosaurus* as a “semiaquatic bipedal ambush piscivore that frequented the margins of coastal and inland waterways” [14]. This description of *Spinosaurus* ecology overlaps with many, although perhaps not all, of the conclusions of previous studies [8,13].

### Bone compactness statistics as lifestyle arbiter

In 2022, Fabbri *et al*. used statistical analysis of bone compactness to classify spinosaurids and other dinosaurs with respect to underwater foraging habits [15]. They paired global bone compactness (*Cg*), defined as cross-sectional area covered by bone divided by total cross-sectional area, and maximum bone diameter (*MD*) with a relatively new statistical method called phylogenetic flexible discriminant analysis (pFDA). This method is described below in the statistical method implementations section of Materials and methods.

The goal was to classify carnivorous dinosaur taxa as either “subaqueous foragers” or not. Bypassing detailed studies of anatomy and biomechanics, their method offered the tantalizing possibility that a broad database including many taxa, each represented by a single (*MD*, *Cg*) datapoint, could yield statistical evidence that would directly reveal where dinosaurs foraged. From this analysis, they concluded that spinosaurids were “aquatic specialists” but with “surprising ecological disparity.” *Spinosaurus* and *Baryonyx*, they argued, made regular use of “subaqueous foraging” with “fully submerged behavior,” whereas *Suchomimus*, a close relative of *Baryonyx*, was a nondiving terrestrial predator restricted to wading in the shallows [15: 852].

The importance of this study is twofold. First, they outlined what appeared to be definitive evidence and a novel approach to determine lifestyle or habitat questions based on bone cross sections, with implications for interpreting the fossil record. Second, if this approach could successfully resolve such a thorny issue, then perhaps it, or approaches inspired by it, could be applied to other lifestyle or habitat questions in the fossil record.

### Study goals

The purpose of the present study is to reexamine the datasets and analytical techniques employed by Fabbri *et al*. to elucidate the foraging habits of spinosaurids, with an aim toward testing the validity for lifestyle inference of bone microanatomy metrics, such as *Cg*, and the use of discriminant analysis, in particular pFDA, as an appropriate statistical method for such inference.

Fabbri *et al*. compiled datasets of (*MD*, *Cg*) points from femoral and rib cross sections representing exemplar taxa from many disparate clades of reptile, mammals, and birds. They manually coded each taxon with two lifestyle attributes, *F* (for flying ability) and *D* (for diving), using a three-value scale: 0, absent; 1, rarer; 2, habitual (see Materials and methods for details). The attribute combinations were then used to divide the taxa into functional groups, for example *F0D0* taxa are terrestrial, whereas *F0D2* taxa are nonflying divers. In Fig 1, convex hull polygons plot the extent of the datapoints for each group.

Femoral (A, C) and rib (B, D) plots of maximum bone diameter (*MD*) versus bone compactness (*Cg*). (A, B) Convex hull polygons colored by functional group, as defined by Fabbri *et al*. Groups with four or fewer datapoints not shown. (C, D) Points and corresponding convex hull polygons for terrestrial groups (*F0D0*) and groups that include nonflying divers (*F0D2*). Abbreviations: 0, absent; 1, rarer; 2, habitual; Ba, *Baryonyx*; D, diving; F, flying; Sp, *Spinosaurus*; Su, *Suchomimus*; u, unknown.

The femoral data for all classes shows extensive overlap among the polygons defined by the (*MD*, *Cg*) points (Fig 1A). The flying classes (*F1D0*, *F2D0*, *F2D1*, *F2D2*) all overlap each other, and most also overlap the *F0D0* class of terrestrial animals that cannot fly and seldom if ever dive. The rib data, though visually different, exhibit no less overlap (Fig 1B).

To classify the spinosaurid dinosaurs, the most relevant comparison is group *F0D0* to *F0D2*, *F0D2* being the nonflying taxa that are habitual “subaqueous foragers,” in the terminology of Fabbri *et al*. [15]. These are shown in Fig 1C and 1D. The pFDA statistical method employed in their study seeks a straight line, called the decision boundary, that cleanly separates datapoints by class. In this case, that means finding a straight line that has the blue points on one side and the brown points on the other side in Fig 1C and 1D. As these plots clearly demonstrate, this is not possible—the blue and brown points are too intermingled. Any line that one did draw would misclassify many of the known taxa by putting them on the wrong side of the line. A method that cannot accurately classify known taxa is suspect when applied to classifying unknown taxa such as the spinosaurids.

The implicit assumption made by Fabbri *et al*. is that a complex statistical method (pFDA) can somehow draw that decision boundary to make an accurate classification. While it is true that in some cases statistical methods can achieve surprising results, we show below that is not the case here. Instead, the complex and opaque statistical methods used by Fabbri *et al*. obscured a fundamental difficulty with classification that arises from the use of these training datasets.

Moreover, statistical analysis is built upon a chain of steps. Performed properly, such analysis can allow surprising conclusions to be drawn with great scientific rigor. However, we find critical problems with many of the steps Fabbri *et al*. conducted in their analysis. Any one of these flaws is sufficient to greatly diminish the probative value of their conclusions; some are sufficient to refute those conclusions altogether. Our study examines each of the problems in detail to elucidate the issues that future research must address when using pFDA, bone compactness metrics, and related methods.

We conclude that the number and nature of the problems with the results reported by Fabbri *et al*. render the approach used in that study largely invalid and of little evidentiary value. The most generous interpretation of those results is that *Spinosaurus* and *Baryonyx* (but not *Suchomimus*) have a slight statistical affinity with animals that have a range of semiaquatic adaptations. A result of this kind would not be helpful in choosing among the conflicting hypotheses for spinosaurid ecology.

As we demonstrate in this study, the pFDA method must be used with some caution because it neither includes tests of its distributional assumptions on the datasets nor natively provides estimates of uncertainty in its classifications that arise from the sample size or other properties of the dataset. Here, we supplement pFDA with explicit tests of the distribution assumptions and find that the dataset used by Fabbri *et al*. fails to meet those tests. Indeed, some portions of the dataset are statistically indistinguishable from uniform random distributions of points.

## Materials and methods

### Institutional Abbreviations

BSPG Bayerische Staatssammlung für Paläontologie und Geologie, Munich, Germany.

FSAC Faculté des Sciences Aïn Chock, University of Casablanca, Morocco.

MNBH Musée National Boubou Hama, Niamey, République de Niger.

CMN Canadian Museum of Nature, Ottawa, Canada.

UCRC University of Chicago Research Collection, Chicago, United States of America.

### Datasets and methods of Fabbri *et al*

The materials and methods we used in our study are best understood in the context of those employed by Fabbri *et al*., so we first summarize the relevant datasets and methods reported in their paper and a subsequent preprint [15,16].

The pFDA method, described below, uses a training dataset comprising several exemplar datasets, each of which is divided into subsets known as classes. Data points in each class share a class property, such as value or range of values of one or more categorical variables. The algorithm analyses the training data, along with test data points. As implemented by Fabbri *et al*., each datapoint corresponds to a specimen of a particular taxon, and two categorical variables are used to assign taxa to classes.

Fabbri *et al*. represented each specimen in a sampled taxon with a two-dimensional data point (log_{10}(*MD*), *Cg*). The parameter *MD* is the maximum diameter of the sampled bone (either femur or rib). The parameter *Cg* can be calculated from an image of a bone cross section binarized based on presence or absence of bone. This image can be derived either from a thin-section photomicrograph or a CT radiograph, as detailed below. Fabbri *et al*. gathered the majority of *Cg* and *MD* measurements in their datasets from the literature; when they collected their own measurements, they used the BoneProfileR program [17,18] to calculate *Cg*.

The two primary datasets employed by Fabbri *et al*. are based on femoral cross sections from one set of taxa and dorsal rib cross sections from a different but largely overlapping set of taxa. Most taxa are represented by a single data point in the femoral dataset, the rib dataset, or both. Multiple datapoints were included for some taxa. Fabbri *et al*. constructed a phylogenetic consensus tree across the taxa. The tree is used in their statistical methods to correct for phylogenetic bias in either phylogenetic general least squares (PGLS) regression or pFDA analysis, using standard methods.

In the initial phase of their analysis, Fabbri *et al*. assigned two categorical variables to each taxon data point. For clarity and concision, we abbreviate the functional groups identified by Fabbri *et al*., using *F* and *D* to designate “flying” and “diving,” respectively, for the two lifestyle behaviors they identified, along with the values (0–2): “absent” (0), “present but infrequent” (1), and “frequent” (2). Here we abbreviate each taxon group as *FxDy*, where *x* and *y* denote the values of the flying and diving variables, respectively.

Habitual (frequent) divers—“subaqueous foragers” in the terminology of Fabbri *et al*.—include taxa such as the emperor penguin *Aptenodytes*, which was assigned categorical variables *F0D2*. As another example, the razorbill *Alca torda* is an extant seabird that frequently flies and dives, so it was classified as *F2D2*.

In summary, the datasets include two skeletal elements (femur and rib), two measured variables (*Cg* and *MD*), and two categorical variables (flying and diving), each of which can be assigned any of three values. Fabbri *et al*. make no distinction in their coding of variables between extinct and extant status, but that distinction is important in some of our analyses. Table 1 summarizes the some of the important subgroups.

Fabbri *et al*. listed the taxon names, which are stated to be shared between the rib and femoral datasets, in their Supplementary Table 1 [15]. Tables of taxon data in a spreadsheet file, along with an R script computer code, were published in their Supplementary Dataset [19]. However, the code does not read the published spreadsheet and instead reads a set of four different spreadsheet data files (Table 2). These previously unpublished files, provided to us by Fabbri *et al*. via email, are in the S1–S4 Files accompanying this article.

Comparison of the data in the published spreadsheets with the files from Table 2, the Supplementary Table 1 of ref. [15], and the text of that paper reveals unexplained discrepancies. Two taxa present in the files from Table 2 (one each in the femur and rib files) do not appear in the associated phylogenetic tree. As a result, these two taxa are automatically discarded by the tree-matching routine in the pFDA code used by Fabbri *et al*. Whereas the body of the paper states that 83 taxa were shared between femur and rib datasets, we count 76 shared taxa. We were nevertheless able to replicate the published results of Fabbri *et al*. using the code and the previously unpublished data files (Table 2), which indicates that they contain the data used to generate the results of the paper.

Datasets ds1 and ds2 (Table 2) include all femoral and rib taxa, respectively. Dataset ds3 is a subset of ds1, and ds4 a subset of ds2, in which selected taxa were removed, as detailed below. Supplementary Table 6 of [15] lists all of the taxa removed from ds2 to form ds4. But Supplementary Table 5, the corresponding table for ds1 and ds3, omits without explanation the taxa *Choeropsis liberiensis* and *Desmostylus hesperus*, which are in ds1 but missing from ds3.

In their main text and Supplementary Tables 5 and 6, Fabbri *et al*. labeled the taxa to be removed as “deep diving.” Elsewhere in their Supplementary information, as well as in the file names, they instead used the term “pelagic.” These terms are not interchangeable, as they convey very distinctive—and sometimes nonoverlapping—lifestyles.

#### Analytical stages of Fabbri *et al*.

Fabbri *et al*. employed a three-stage analytical method. The first stage performed PGLS to regress *Cg* (as the dependent variable) against the categorical lifestyle variable *D* (results in Table 1 of ref. [15]), and then regressed *Cg* (as dependent variable) against all combinations of the categorical lifestyle variables and *MD*. ANOVA results were presented [15: Supplementary Tables 3 and 4]. The results show very weak but statistically significant correlations in some cases, with *P* values reported as low as 0 (presumably due to rounding) but *R*^{2} = 0.176 for femoral data ds1, and *R*^{2} = 0.108 for rib data ds2.

In the second stage of their analysis, datasets were prepared for pFDA. Datasets for each skeletal element (*i*.*e*., femur or rib) were sorted by the categorical variables to yield two classes: nonflying subaqueous foragers (*F0D2* using the abbreviations of this study) and everything else (*F0D0*, *F0D1*, *F1D0*, *F1D1*, *F1D2*, *F2D2*); the class assignments are listed in the spreadsheet files of Table 2. These two classes were subsequently used for the classification of test taxa. Fabbri *et al*. stated that “our inference has only two possible outcomes: subaqueous forager or non-subaqueous forager” [15].

The third and final stage of their analysis applied the pFDA algorithm to process the training datasets and then classify other data points representing test taxa, including spinosaurids. The algorithm is coded in an R script that builds on base-level pFDA code deposited by Motani and Schmitz in an online repository [20].

pFDA requires a phylogenetic tree across all taxa as input. The original pFDA papers [21,22], and the available code repository [20] used training datasets of entirely extant taxa. The same is true for all prior uses of pFDA that we could find via searches on Google Scholar for citations of the original papers or repository (representative examples include [23–26]). Fabbri *et al*. instead used training sets that mix extant and extinct taxa. To account for uncertainty in the timing of the phylogenetic tree nodes for extinct taxa, their method adopted an *ad hoc* approach, which is not referenced as occurring elsewhere: “We repeated analyses across 100 informal supertrees with varying branch lengths to account for stratigraphic uncertainty” [15]. This was done for both their PGLS and pFDA analyses. Their R code creates the random trees and loops over them.

This stage resulted in a set of assignments classifying the test taxa into the two classes in the training set. For each assignment, it generated a posterior probability of class membership in the *D =* 2 class, denoted *P*_{2} hereafter. Because each run of the pFDA produced 100 results—one result for each of the 100 random phylogenetic trees—Fabbri *et al*. presented the median of the set of *P*_{2} values. The default classification criterion was to assign a test taxon to a class if the posterior probability was ≥0.5; the classifications across the 100 trials were reported as a count of the number the trials classified as belonging to the *D =* 2 class.

### Methodology for measurement of bone compactness

We performed new *Cg* measurements on specimens not included in the datasets of Fabbri *et al*., and we also attempted to replicate some of their measurements of *Cg*. We used Materialise Mimics Innovation Suite 23.0 to segment computed-tomographic (CT) scans of specimens new to this study. We positioned long bones for cross section perpendicular to the shaft axis. We used a threshold that highlighted bone and exported that highlighted image of the cross-sectional slice.

We used the BoneProfileR R package [17,18] and the binarized femoral slice images provided by Fabbri *et al*. in their Fig 1 and S1–S5 Figs [15] to measure *Cg*. To ensure that pixels were correctly read by BoneProfilerR, Affinity Photo was used to binarize all new images. Because user-input parameters for the BoneProfilerR program were not reported in Fabbri *et al*., some variance in our results is expected. For complete sections, we used the ontogenetic center (recommended by the authors of BoneProfileR [17]) in the BP_EstimateCompactness function and defaults of 60 angles and 100 distances. We collected bone compactness data from the flexit and flexit-with-pi rotation models. There were three partial cross sections, which were run using a user-defined center with setting partial = TRUE in the BP_EstimateCompactness function. A few of the cross sections published by Fabbri *et al*. are of low resolution, necessitating rebinarization.

### Computed tomography

In order to provide measurements of *Cg* for spinosaurids on specimens not considered in Fabbri *et al*., CT scans were acquired for femora of *Suchomimus tenerensis* (MNBH GAD500, MNBH GAD72) and *Spinosaurus aegyptiacus* (FSAC-KK 11888) at the University of Chicago Hospitals by Dr. Nicholas Gruszauskas and Dr. David Klein using a Philips Brilliance iCT 256-slice multi-detector CT scanner. CT scans for the additional *Spinosaurus sp*. femora (CMN 41869, CMN 50382) were generated by Vincent Bolduc at the Transportation Safety Board of Canada’s North Star Imaging CT scanner. Scan settings for each of the specimens are included in S1 Table.

### Statistical method implementations

All pFDA results in this paper were based on R scripts and associated data files obtained from the authors of Fabbri *et al*. [15] and on base-level pFDA code deposited by Motani and Schmitz in an online repository [20]. Bootstrap trials and related modifications were implemented in R, with minimal changes necessary to the base-level pFDA code for debugging.

Bootstrapping pFDA consists of randomly selecting with replacement a sample of the dataset taxa of the same length as the original dataset, and then running the analysis on each such trial set. The selection process results in bootstrap samples that may omit some specimens from the original dataset and may include other specimens more than once. As is typical in bootstrap analysis, 2000 trials were done for each bootstrap run [27–29]. Consistent with the approach of Fabbri *et al*., we created 100 random phylogenetic trees for each such trial, so a single bootstrap analysis of a dataset created 200,000 individual pFDA runs.

Phylogenetic trees must be pruned appropriately, which was accomplished in the same manner as pFDA, using the same R library functions that were employed by the pFDA code from Motani and Schmitz [20] that was used by Fabbri *et al*. As a verification step, the phylogenetic matrices and transformed datasets were independently calculated with Phylogenetics-for-Mathematica [30]; we found identical results within expected numerical precision. Output data from the pFDA functions, including the confusion matrix and posterior probabilities, were saved in files for later analysis and plotting. Confidence intervals on bootstrap output data were computed using the bias-corrected and accelerated (BCa) bootstrap algorithm [27–29], which is based on both bootstrap and jackknife trials. This was implemented by the authors in Mathematica. The R code, Mathematica bootstrap code, and other Mathematica code used in this study are available in an online repository [31].

In our checks for possible selection bias in the taxa included in the datasets, we performed permutation tests on the rank distribution of *Cg* between extinct and extant taxa, using Mathematica to implement standard methods [32]. Statistical analysis of the output of the trials gathered in R, along with the data tables and figures, were generated with code written in Mathematica 13.2 [33]. Statistical tests, such as Brown-Forsythe, Conover, and Levene variance equivalence tests, used standard library functions in Mathematica. Other library functions were used to fit distributions in the construction of smooth kernel distribution plots and quantile-quantile plots.

Code was written by the authors for simple LDA (linear discriminant analysis) and a Monte Carlo simulation using LDA, which are described in the next subsection. Eq (3) in the section below on the pFDA method was derived in Mathematica.

Statistical distributions were fit to data using standard library functions in Mathematica. Code written by the authors in Mathematica calculated AIC and AIC_{c} values and Akaike weights for distribution fits using standard methods [34]. Mathematica was also used to generate all the graphs and plots in the paper.

To test whether data points generated by pFDA exhibit genuine clustering, we used the Hopkins statistic. Code implementing Hopkins statistic tests was written by the authors in Mathematica, using the published algorithms [35,36]. Under the null hypothesis, the Hopkins statistic is expected to approximate a beta distribution: Beta (*m*, *m*), where *m* is the number of points sampled. As recommended in the literature [35,36], a random sample of 20% of the points in a test set was used, and the *H* statistic was calculated as the mean of 100 random trials. As an additional verification, a Monte Carlo suite of 10,000 pseudorandom examples of a uniformly random distribution were generated and tested to build an empirical sampling distribution for the null hypothesis. This was done separately for each of the variants of the Hopkins statistic test, as well as for each point count in a set being tested.

#### The pFDA method.

Fabbri *et al*. used pFDA to reach their conclusions regarding the identification of habitual behaviors in extinct tetrapods. pFDA, a phylogenetic adaptation of flexible discriminant analysis (FDA), was first applied to study nocturnality in dinosaurs via statistical analysis of eye and scleral ring shape [21,22]. FDA, in turn, is a generalization by Hastie *et al*. [37] of Fisher’s much earlier linear discriminant analysis (LDA) [38].

Fisher created LDA to separate classes of data. Each class is represented by a set of points (in dimensions 2 or greater) drawn from multivariate normal distributions. The distribution for each class must have a distinct mean (centroid), but all classes must share the same covariance matrix. Later work has shown that LDA is closely related to ANOVA and regression techniques [39]. LDA computes the coordinates of a line, called the decision boundary, that divides the points into regions that can be classified into the nearest class. LDA has previously been used with bone-compactness data to discriminate among (classify) groups without incorporating phylogenetic data in the analysis [40–44].

The general form of the probability density function for a bivariate normal distribution is given by Eq (1), where *x* is a two-dimensional position vector, *μ* is a two-dimensional position of the centroid of the distribution, Σ is a 2×2 covariance matrix that has |Σ| as its determinant, and the superscript ^{T} denotes matrix transpose. The probability function for class *k* is given by
(1)

In the case of two-class or binary classification, LDA assumes that there is a different distribution for each class, with centroids *μ* = *μ*_{1}, *μ*_{2} for classes *k* = 1,2. The centroids must be distinct (*i*.*e*., *μ*_{1}≠*μ*_{2}*)*, but both distributions have the same covariance matrix Σ. Mathematically, this assumption ensures that the decision boundary is a line [39].

A related classification method that allows each set to have a different covariance matrix is known as quadratic discriminant analysis (QDA) because the decision boundary between the datasets is a quadratic curve (*i*.*e*., a conic section). If LDA were applied to such a dataset, however, one would expect highly inaccurate classification because the straight-line assumption is violated [39].

LDA classifies points by computing the Mahalanobis distance from a test point to the centroid of two or more reference groups, using the pooled, within-group covariance matrix [39]. The squared Mahalanobis distance appears in an argument to the exponential function in Eq (1). In the case of a distribution with unit variance and a covariance matrix that is the identity matrix, *i*.*e*., , it reduces to the Euclidean distance.

In LDA and FDA, a fundamental assumption is that a test point can be classified by assigning it to the group that has the smallest Mahalanobis distance between the point and the group centroids μ_{1}, μ_{2} (*i*.*e*., the multidimensional means of the classes). The locus of points equidistant between group centroids corresponds to the decision boundary; for LDA and pFDA, that is a line. Hastie *et al*. generalized LDA to FDA by using a general framework that allowed general nonlinear decision boundaries. They also added support for a Bayesian approach, using prior probabilities [37].

All of these methods (LDA, FDA, pFDA) perform a geometric transformation to find the directions in which the variance between the sets is minimized and maximized. This acts as dimensional reduction; in a system with two classes and two-dimensional data points, the geometric transformation projects the data into one-dimensional points called discriminants that are used perform classification and assign posterior probabilities of class membership [39].

Motani and Schmitz [21] introduced pFDA as a specific instance of FDA in which a phylogenetic-bias correction is performed in a similar fashion to PGLS, by using branch lengths from phylogenetic trees that cover the taxa in the analysis to determine phylogenetic correlations among taxa under an evolutionary model, such as Brownian motion. In principle, FDA could allow the use of nonlinear decision boundaries, but pFDA as implemented by Motani and Schmitz [20] (and used by Fabbri *et al*.) is restricted to using linear boundaries, thereby assuming that both groups have the same covariance matrix, as in Eq (1). pFDA is thus a phylogenetic version of LDA. As currently conceived, pFDA does not allow classes to have different covariance matrices as with QDA, nor does it allow other classes of curves or relation of distributional assumptions. Conceivably a pQDA or a quadratic variant of pFDA could be developed, but this has not been proposed, nor is it used by Fabbri *et al*.

The procedure presented in Motani and Schmitz [21] uses extant taxa that have well-constrained phylogenies and branch lengths for the training set. The use of data from extant taxa in the training set has several advantages. One is that the phylogenies are likely to be better known, reducing the possibility that error from the phylogeny could confound the results. However, the primary reason is that Motani and Schmitz were seeking to classify a behavioral pattern (*e*.*g*., whether daily activity was primarily nocturnal or diurnal), which can be observed in living organisms but is not directly accessible for extinct taxa.

Using extant taxa to make a classification inference on extinct taxa implicitly presupposes that the statistical distribution of the variables used in the analysis is the same for extinct and extant taxa. Otherwise, one could come up with a criterion boundary based on extant taxa that would have unknown relevance to the extinct test taxon.

In the case of the Motani and Schmitz study, the variables were eye-related dimensions, which have a strong theoretical basis in optical physics, so consistency in distribution across millions of years of evolution is highly plausible. As a result, it was not a major concern for their study. However, such temporal invariance in distribution is not automatically guaranteed when pFDA is applied to other variables.

In an extensive literature search, we were unable to find any other study that trained the pFDA classifier on mixed extinct and extant taxa, as Fabbri *et al*. did. To address the potentially greater phylogenetic uncertainty with extinct taxa, Fabbri *et al*. created 100 trees of random branch length, each having its own associated phylogenetic covariance matrix. The matrices are sequentially passed to code that performs FDA, resulting in 100 classification probabilities for each test taxon—one for each random tree. Any new method such as this should be accompanied by evidence that it performs as intended to address the problem of uncertain phylogeny and that the parameters chosen—*e*.*g*., the number of trees, the random assignment of branch lengths, ignoring different tree topologies—are sufficient for the classification task. Fabbri *et al*. presented no such evidence or justification.

Fabbri *et al*. used a low threshold of 50% on the median probability, which results in weak classifications. The use of the median is not justified because a median discards random trees that, by construction, are all equally likely to represent past evolution. A better approach, which is widely used in the statistical literature, is to estimate a confidence interval, such as the 95% confidence interval.

Although in principle the phylogenetic signal could have a strong effect, in practice, Fabbri *et al*. find very little evidence of phylogenetic signal in their dataset, with Pagel’s λ parameter taking values 0.02≤λ≤0.07 across the various datasets and trials. This is consistent with other studies of *Cg* that used comparable datasets to analyze convergent features across many clades [45,46]. As a result, one would expect little difference between these results and those obtained with ordinary LDA. In view of that and the uncertainty in the tree for extinct taxa, we question whether the use of a phylogenetic method is worth the added complexity for this dataset.

To illustrate the properties of LDA and pFDA, we consider a special case of Eq (1) for two distributions having the properties given in Eq (2), where the covariance matrix Σ is identical for both distributions and is a multiple *σ*^{2}/2 of the 2×2 identity matrix.
(2)
The centroids of the two distributions, *μ*_{1} and *μ*_{2}, are reflected in the origin across the line *y* = −*x*. It is easily shown that the decision boundary must be the perpendicular bisector of the line between the centroids *μ*_{1} and *μ*_{2}. In this case, *μ*_{1} and *μ*_{2} both lie on the line *y* = −*x*, and thus *y* = *x* is the bisector. Fig 2 plots 1000 points drawn from each of two distributions, denoted group 1 and group 2. In both cases, *σ* = 0.55 and plays a role in these bivariate distributions that is very similar to the parameter in a conventional univariate normal distribution. The distance between the centroid of either distribution and the decision boundary is *d* = 1.7 = 3.1*σ*. As a result, the concentration of points matches what one would expect of a univariate normal distribution: most of the points are concentrated near the centroid and thus appear on the same side of the decision boundary as the centroid. Such points would be correctly classified by the decision boundary.

(A) 1000 pseudorandom points drawn from each of two multivariate normal distributions given by Eqs (1) and (2) and *σ* = 0.55, with points from each distribution colored according to the legend. The decision boundary for LDA is given by the red line; points above the line are classified as group 1, points below the line are classified as group 2. Note that one point from group 1 lies on the other side of the decision boundary and is incorrectly classified as group 2. One point from group 2 is similarly misclassified. The centroid of each distribution is denoted by a black cross, the distance *d* from the centroid to the decision boundary is denoted by a dashed blue line. The confusion matrix *c* (Eq (5) in S1 Appendix) is shown. (B) 59 points from distributions with the same centroids as (A) but with *σ* = 1.414. The higher value of *σ* leads to a larger number of points being misclassified. (C) The underlying probability density functions for the same distributions as in (B). The distributions of blue and gold points are equal at the red decision boundary line *y* = −*x*. Abbreviations: G1, group 1; G2, group 2.

Points that fall on the opposite side of the decision boundary are considered *misclassified*. Because these points are part of the training dataset, they are termed training data errors [39]. Because the points are highly concentrated and the decision boundary is relatively far in terms of *σ*, there are only a few of these points in the random sample shown. Fig 2B shows an example with the same distribution centroids, but with 59 points in each group and *σ* = 1.414, such that *d* = 1.2*σ*. The shorter distance in terms of *σ* greatly increases the number of training data errors.

The fundamental idea behind LDA is shown in a plot of the probability density functions for the multivariate normal distributions given by Eqs (1) and (2) (Fig 2C) with the same *σ* = 1.414 used to generate Fig 2B. The two normal distributions intersect at a 3-D curve that falls along the line *y* = −*x* when projected onto the (*x*, *y*) plane. That decision boundary is where the probabilities of membership in both probability density functions are equal. Away from that boundary, one probability is greater than the other.

One can calculate the exact probability that a point will lie on the wrong side of the decision boundary by integrating the probability density function over the half plane defined by the wrong side of the decision boundary to yield Eq (3), where erfc() is the error function and *d* is the distance from the distribution centroid to the decision boundary.
(3)
This relation matches the familiar case of the marginal distribution of points in a normal distribution, expressed in terms of the standard-deviation-adjusted distance (*i*.*e*., the ratio *d*/*σ*). Thus, we expect from Eq (3) that 68.27% of the points would be misclassified if *d* = *σ*, 2.5% of the points to be misclassified if *d* = 1.96*σ*, and 1% if *d* = 2.33*σ*, following usual rules of thumb.

In the example shown in Fig 2B, *P*_{wrong} = 0.115, so we expect that about 11.5% of points in each group will be misclassified if the number of points is very large. For the distributions in Fig 2A, *P*_{wrong} = 0.00955, or roughly 1 in 1000. The one classification error seen in group 1 and one error seen in group 2 thus match expectations. As the number of trials increases, the number of incorrect points converges toward *n*×*P*_{wrong}, with some statistical variation.

Eq (3) reveals an important principle: even when we use simulated data drawn from multivariate normal distributions, classification via LDA or FDA can *never be error-free*. That follows from the simple fact that the domain of the multivariate normal distribution ranges across the interval (−∞, ∞) in each independent variable, whereas the distance from the distribution centroids to the decision boundary is finite. Therefore, there can always be valid points from one distribution that lie on the other side of any decision boundary—not as an outlier (which implies an erroneous point) but rather as an entirely valid data point that LDA will misclassify. Note that this effect does *not* depend on the sample size. As the number of data points in the training set grows to infinity, the error converges to Eq (3).

#### Assigning confidence to classifications.

Sound statistical practice recognizes that random variations do occur and can lead to false inferences, even when statistical methods are applied correctly. Inferences are thus routinely qualified and evaluated. Results should be qualified by providing quantitative estimates of their statistical quality, such as the *P* value, confidence level, confidence interval, or other measures. Those quality estimates should then be evaluated against widely accepted thresholds for statistical significance, such as the current de facto standard of 95% significance, often expressed as 5% random error, *p**≤*0.05, or a 95% confidence interval (CI). Although studies do sometimes employ other standards with justification (ref. [47] and S1 Appendix, section 3), Fabbri *et al*. selected the conventional 95% significance threshold for their PGLS and ANOVA analyses [15: Table 1, Supplementary Tables 3 and 4].

Because LDA and FDA were not designed for hypothesis testing or statistically rigorous inference, they do not natively produce a formal *P* value, confidence interval, or other metric of random effects. These methods are typically used for *ad hoc* applications of statistical learning or machine learning, often on ill-posed problems such as handwriting recognition [37].

The pFDA method inherits this weakness from its predecessor methods. As a consequence, classification by running the pFDA algorithm does not by itself offer a rigorous statistical test. Strictly applied, statistical standards would rule out the use of pFDA as the basis for scientific conclusions until a rigorous theoretical framework has been developed that can assess the quality of pFDA classifications.

In the absence of such a framework, we attempt here to estimate the statistical quality of pFDA with two available tools: posterior probabilities and empirical classification performance on known cases. An invocation of a pFDA classifier returns a list of the predicted probabilities of class membership for each of the test taxa to be classified. We denote as *P*_{2} the pFDA estimate of posterior probability that a point in the datasets of Fabbri *et al*. belongs to the class *D = 2*. Because each test point is classified for 100 random phylogenetic trees, the result for a single taxon is typically a list of *P*_{2} values of length 100. If the median value of the *P*_{2} list is greater than 0.5, Fabbri *et al*. classified the taxon as *D =* 2.

Fabbri *et al*. acknowledged that a 50% probability is an unusually weak criterion for assigning class membership [15: 859]:

We summarised our results by providing the median value of those 100 posterior probabilities and the number of times a particular taxon is predicted as subaqueous forager (median probability of 50% or more). This gives us two proxies of the likelihood of each taxon to be an actual subaqueous forager. For instance, a taxon could be predicted 100 times as subaqueous forager with a median probability of 51% which means the evidence for this extinct species to be an actual subaqueous forager is very weak and this inference has to be considered very unlikely. Median probabilities need to be within the range of 80–100% to be considered as strong evidence of subaqueous forager.

Because there are two classes, a classification probability of 0.5 is the accuracy we would expect from a random guess, such as flipping a coin. Normally, a result that is only infinitesimally better than random would be accorded little probative value. Nevertheless, this weak criterion was used for classification rather than the stronger values of *P*_{2}>0.8 or *P*_{2} = 1.0 that are suggested in the passage.

If *P*_{2} were an absolute probability, then *P*_{2} = 1.0 would indicate no possibility of misclassification. But *P*_{2} is *not* an absolute probability—instead it is a classification score that, at best, provides a possibly erroneous estimate of the *relative* probability of being in one class versus the alternative, conditioned on the prerequisite that the classes are multivariate normal distributions [37]. Furthermore, as used by Fabbri *et al*., *P*_{2} is not a single value but rather a list of 100 values from their Monte Carlo trials; it is fundamentally a statistical quantity. In this study, we build on this treatment of *P*_{2} as a statistical quantity by also including bootstrap trials, which explore the error due to finite sample size—*i*.*e*., statistical variation arising from the finite size of the training dataset.

Each *P*_{2} value is derived from the ratio of the probabilities given by the normal distribution describing each class, distributions that should have different centroids but same variance. Test points are often distant from the centroids and thus often fall in the tail of the distribution for one or both of the classes. Tail probability estimates derived from a finite sample of data points can be uncertain, particularly in the case that the points are near or outside the edge of the points in the training dataset. Thus, the computation of *P*_{2} is extremely sensitive to the conformance of the datasets to the stated assumptions of being normal, having different means, and having the same covariance matrix.

Reporting a median value of a Monte Carlo experiment without a confidence interval, as Fabbri *et al*. did, is entirely out of keeping with conventional statistical practice. We report 95% confidence intervals, as is standard in many scientific disciplines.

Due to the statistical uncertainty in the value of *P*_{2}, the classification threshold should not be that median *P*_{2}≥0.95—the threshold used by Fabbri *et al*.—but rather that the lower bound of the 95% CI on *P*_{2} must be greater than or equal to 0.95. This heuristic effectively requires 95% confidence that the classification is at least 95% correct. The threshold value for this heuristic is “within the range of 80–100% to be considered as strong evidence” that Fabbri *et al*. propose, but it is implemented using the standard technique of the 95% confidence interval rather than the median.

In contrast, a correct interpretation of the criterion that the median *P*_{2}≥0.95 is that 50% of the time we should expect that there is more than 5% classification uncertainty. That weaker criterion is not possible to reconcile with conventional standards for statistical significance or confidence. Although one could argue for demanding 100% confidence that the classification is 95% correct, we did not use that approach because we felt that adherence to the commonly used 95% confidence interval is important.

To be clear, this is the criterion for strong evidence, not the baseline classification criterion. It may seem that a higher *P*_{2} classification threshold for all classification (not just the strongest) would be a better choice, but the situation is more nuanced. Increasing the classification threshold does make for a more stringent criterion, but it also results in misclassification of a greater percentage of the training dataset (S1 Appendix, section 3).

*P*_{2} indicates the strength of the prediction for a particular taxon; the values and confidence intervals for *P*_{2} will vary from taxon to taxon. To assess pFDA classification performance overall, it is useful to evaluate how well the classification performs on known cases by assessing training data errors (misclassifications of the training set), a standard technique in the statistical and machine-learning literature. Because unknown data would be expected to result in more misclassification than known data points, training data error is generally considered to be an overly optimistic estimate of performance [39,48,49].

Fabbri *et al*. mentioned classification performance only in this passage [15: 856]:

The correct classification rates of our phylogenetically flexible discriminant analyses ranges are 84–85% (femora) and 83–84% (ribs) (Figs 2 and 3, Supplementary Materials, Supplementary Tables 7–10). This increases to 90% in both datasets when excluding graviportal and deep diving taxa (Figs 2 and 3, Supplementary Tables 7–10).

The Supplementary Tables 7–10 they cited in the passage do *not* contain correct classification rates, and the definition of “correct classification” is highly ambiguous because the work did not specify which of the multiple classification performance metrics were used (see S1 Appendix, section 4). The referenced tables contain median *P*_{2} values for the dinosaur test taxa, including the spinosaurids, but not for any taxa of known class in the training datasets. They therefore cannot be used as a basis for a correct classification rate. A defined metric of training data errors, known as accuracy and denoted here as *A* (Eq (6) in S1 Appendix), can be derived from output from Schmitz’s and Motani’s pFDA base-layer code [20]. Thus, it is plausible that Fabbri *et al*. used the accuracy metric *A* when they computed the 83–85% correct classification rates, but we cannot rule out the use of some other, undescribed metric.

Correct classification of 83–85% implies a misclassification rate of 15–17%. This reflects performance that seems, on its face, at least three times worse than the usual 5% threshold for random results in statistical methods. Such a result would normally be considered not statistically significant. However, the assessment of the error in classification is complicated by the fact that a classifier that makes a constant guess (*i*.*e*., *P*_{2} = 1.0 for all points, or *P*_{2} = 0 for all points) will be correct 50% of the time if the test taxa are equally distributed between the two classes. So will a classifier that makes random guesses. Yet neither a constant nor a random classifier would have any scientific value.

This effect suggests a useful thought experiment, in which we consider a mathematically equivalent case (with respect to overall classification performance) where the classification is completely random with probability *P*_{rand} and correct with probability 1−*P*_{rand}. In such a case can interpret accuracy *A* as an estimate that the classification is correct, *P*_{class}, so *A* = *P*_{class} = 0.5*P*_{rand}+1.0(1−*P*_{rand}), which reduces to
(4)
Applied to the case above with *A* = 0.85—an accuracy of 85%, comparable to that claimed by Fabbri *et al*.—we find that *P*_{rand} = 0.3. Thus an 85% “correct classification rate” means that the classification is mathematically equivalent in performance to the classification being random 30% of the time and correct 70% of the time. This is *six times* the conventional threshold of 5% for the effect to be due to randomness. Such a result would not typically be considered strong enough to warrant a scientific conclusion.

Heuristically, the conventional threshold of 5% can be cast as *P*_{rand}≤0.05, which is equivalent to *A*≥97.5% by Eq (4). Because training set *A* is an overly optimistic measure of classification performance, this still is a very loose criterion. Eq (4) provides an important heuristic, which we use in this study to assess the degree of randomness in classification. However, in all cases we present the actual numerical values of the 95% CI on *A* and *P*_{rand}, as well as two other classification metrics, *B* and *MCC*, that are defined in S1 Appendix.

The accuracy *A* simply tallies up incorrect classifications and divides by the size of the training set (Eq (6) in S1 Appendix). Many classifiers make a systematic distinction between false-positive errors—which classify class 1 datapoints in the training as class 2—and false-negative errors, which make the inverse mistake. That difference, and many other factors, introduce complications in characterizing classifier performance. The development of classification metrics for statistical classifier algorithms such as pFDA is a very active area of statistical research, with practical applications in areas such as medical diagnostics. Although the topic is beyond the scope of the present work, S1 Appendix sections 2–4 introduce the basics.

Adding an unfortunate complication, we discovered a flaw in the pFDA code that systematically misstates the confusion matrix from which classification performance is measured (S1 Appendix, section 5) by reporting a matrix that is the transpose of the confusion matrix, as it is typically laid out in the literature. Our replication attempts produce classification rates slightly different from those reported by Fabbri *et al*. This issue may be why they do not match exactly.

To judge the performance of classification in this study, we employ two heuristics. One method is to inquire whether the lower bound of the 95% CI for *P*_{2} is above 0.95. That tells us whether the prediction for a single taxon has strong support. This heuristic is predicated on the assumptions of pFDA that the classes are normally distributed with different means and the same variance.

The second approach empirically measures how often the classifier correctly or incorrectly classifies its own training dataset, quantifying its success with metrics such as the accuracy *A* and others (*B* and *MCC*) described in S1 Appendix. We then convert those results to the heuristic metric *P*_{rand} (Eq (4)), the probability that the classifier acts randomly. *P*_{rand} is an overall metric of the classifier, specific not to a particular taxon but to the entire set of taxa in the training set. By characterizing the performance of the classifier on known cases, *P*_{rand} helps calibrate the confidence we should have when using the classifier to extrapolate unknown cases.

Having two different approaches begs a question of how they interact with each other. Unfortunately, the answer awaits further research in statistics. Uncertainties of this kind are the price one pays for attempting to use a statistical method that was never intended to provide the primary statistical evidence for a scientific conclusion.

## Results and discussion

Our examination of the analysis of bone compactness to infer spinosaurid behavior included a critical assessment of several aspects of this methodology. We identified a number of substantive issues that constrain the inferential utility of the method, ranging from logical and statistical problems with regressions based on the *Cg* metric to accounting for uncertainty in those measurements that arises from quantification techniques and biological variation within and among specimens. We describe these issues, along with results from our attempted replication of *Cg* measurements reported by Fabbri *et al*., in the following subsections.

We also identified more general issues with the application of pFDA to data of this kind and to inferences about the behavior of extinct taxa such as dinosaurs. Additional subsections below present our findings on the effects of training-set sample size and selection criteria and demonstrate how researchers can test whether training sets meet the distributional requirements of the pFDA method, again focusing on the recent study of Fabbri *et al*. as a noteworthy example.

*Cg* and the bone ballast hypothesis

One of the two independent variables in the datasets analyzed by Fabbri *et al*., as well as in this study, is global compactness *Cg*, a longstanding numerical metric of bone microanatomy describing the amount of bone present in a given cross-sectional slice. Because of its effects on buoyancy, bone density is the primary biological parameter of interest, and as *Cg* correlates to bone density, it has been widely used as a proxy for density in the literature [18,50,51]. Fabbri *et al*. used “*Cg*” interchangeably with “bone density” in their study.

It is worth noting that “bone density” in this context refers to the density of whole bones, not to the density of the bone material itself. For example, *Cg* and whole bone density are both low in flying birds and bats, whereas the actual bone material when studied in isolation is quite dense [51].

The *Cg* metric is only one of many available to capture bone microanatomy. Over the last decade, other metrics generated by the Bone Profiler program have been shown to better correlate with lifestyle in studies of extant amniotes than *Cg* does [52,53]. The additional metrics were not used by Fabbri *et al*. and are beyond the scope of the present study.

Fabbri *et al*. signaled their interest in bone density as a marker of lifestyle when they stated that increased bone density “results in increased body density, facilitating buoyancy control during subaqueous immersion related to either submerged aquatic foraging (for example, in underwater pursuit divers), concealment or refuge” [15]. The idea that bone density can act as ballast helpful to certain secondarily semiaquatic taxa is well-studied in the literature, where it is sometimes termed the “bone ballast hypothesis” [54]. However, Fabbri *et al*. misstated and oversimplified the long literature on the bone ballast hypothesis in the quote and elsewhere in their study.

Increase in bone density occurs by pachyostosis, which involves an increase in dense peripheral bone deposits, and/or by osteosclerosis, which involves an increase in bone deposition toward the center of the medullary cavity of long bones [50,54]. The potential advantages for semiaquatic and fully aquatic tetrapods are known to depend greatly on lifestyle: denser bones lead to a denser body, which can facilitate diving and compensate for larger lung capacity, but the increased mass also makes animals less maneuverable [50,54]. As Taylor summarizes [54],

These features are useful for slow swimmers and shallow divers, such as feeders on benthic plants and invertebrates. Examples are sirenians, primitive sauropterygians (“nothosaurs”), placodonts, and the sea otter

Enhydra.

Taylor and other researchers [50] have found that lifestyles other than those noted in the quote above are *not* compatible with increased bone density, as evidenced by the fact that increased bone density is typically not found in fast swimmers or pursuit predators.

The statement by Fabbri *et al*. conflates behaviors in which increased bone density does offer an advantage—*i*.*e*., “buoyancy control during subaqueous immersion”—and behaviors in which it may or may not apply (“concealment and refuge”) with “underwater pursuit divers,” which the literature makes clear are *not* helped by increased density and indeed are found to have lower density and *Cg*. This conflation is directly contradicted by Taylor, as well as by Houssaye’s review [50], which Fabbri *et al*. erroneously referenced in support of their position.

The relationship between bone density, or its proxy metric *Cg*, and semiaquatic or fully aquatic taxa via the bone ballast hypothesis is thus not simple [44,52,55,56]. The fully aquatic sirenians *Dugong dugong* (*Cg* = 0.994 in ds2) and *Trichechus manatus* (*Cg* = 0.977 in ds2) have very dense bones, which reduce energy expenditure while foraging underwater vegetation. The sea otter *Enhydra lutris* (*Cg* = 0.908 in ds2) must dive for shellfish, which rarely require pursuit. The lower *Cg* of Bryde’s whale *Balaenoptera brydei* (*Cg* = 0.611 in ds2) is consistent with their fast pursuit of prey, and the semiaquatic seal *Phoca vitrulina* has an even lower value (*Cg* = 0.436 in ds2) [44,50,52,54]. Fabbri *et al*. assigned all these example taxa to the *F0D2* class, despite their important lifestyle differences.

#### Regression analysis to explain *Cg*.

Fabbri *et al*. performed a PGLS regression analysis to estimate how well values of the categorical variables *F* and *D* explain *Cg* in their datasets [15: Table 1]. One might expect *Cg* to be the independent variable and *D* the dependent variable to interrogate whether *Cg* predicts lifestyle. Indeed, the subsequent pFDA uses both *MD* and *Cg* as the independent variables; lifestyle (*i*.*e*., class membership) and thus *F* and *D* are implicit dependent variables. For reasons that are not explained, the PGLS regression does the opposite: it treats *D* as an independent variable and *Cg* as the dependent variable. Correlations in a linear regression apply in either direction, so we find no substantive impact of this choice.

Fabbri *et al*. reported statistically significant but very weak correlations in both femoral and rib datasets between *Cg* and their subaqueous foraging category *D* = 2. In the femoral data, *Cg* values range from 0.279 to 0.989, a difference of 0.71. With a coefficient of determination *R*^{2} = 0.172, we would expect about 17.2% of the total variation in *Cg*—or an absolute difference of 0.122 in *Cg* across the full range of values—to be attributable to “subaqueous foraging.” In the rib data, *Cg* ranges from 0.242 to 0.998, a difference of 0.756, and *R*^{2} = 0.108, with subaqueous foraging explaining about 10.8% (0.082) of the total variation in *Cg*. One possible reason that the effect is small is that the datasets they used for the analysis were not well chosen to test the bone ballast hypothesis, as both datasets grouped together high-*Cg* and low-*Cg* diving taxa.

Fabbri *et al*. interpreted their regressions results as confirmation that “frequent subaqueous foraging is associated with increased femoral and rib density across amniotes.” Our results show that this greatly overstates the case and contradicts the literature. Prior studies have made it clear that the bone ballast hypothesis is not some sort of universal law of nature across amniotes but has many exceptions, and that a number of other ecological and lifestyle factors may play roles in increased bone density [44,46,50,52,54–56].

In addition to the examples outlined above and in the literature, detailed study within some lineages has shown that the bone ballast hypothesis has its limits [54]. For example, within talpid moles, which include fossorial (burrowing) as well as terrestrial and semiaquatic forms (including semiaquatic desmans in the datasets used by Fabbri *et al*.), no correlation has been found between lifestyle and *Cg* measurements from the humerus [57]. It is thus misleading for Fabbri *et al*. to characterize the regression result as empirical evidence for a general bone ballast rule across amniotes. The dataset is not sufficiently comprehensive, nor are there tests for semiaquatic or aquatic clades that might violate the rule, examples of which Fabbri *et al*. included in their datasets. Simply because an aggregate characteristic of a group (such as the regression result) holds does not imply that one can draw a conclusion about every member of the group—doing so is an example of the ecological fallacy (see S1 Appendix, section 1).

A lesson for future studies is that great care must be taken when drawing sweeping conclusions, particularly if they are contradicted by the available literature or miss large groups that are central to the analysis.

#### Unsupported inferences about subaqueous foraging.

The essence of the pFDA method is that each member of a class must share a property, as detailed in Materials and methods. If the members of the class do not actually share the property of the class, valid inference from the classification is limited. Fabbri *et al*. claimed that *F0D2* is the class of nonflying animals that practice subaqueous foraging. Our examination finds that this is not the case.

Their paper does not formally define “subaqueous foraging” but distinguishes it from other aquatic lifestyles [15]:

Secondary adaptations to aquatic lifestyles, such as wading behaviour (shoreline specialist and/or only partially submerged habit), subaqueous foraging (fully submerged behaviour) and deep diving, evolved multiple times in every major amniote group.

In context, the term appears to mean foraging while fully submerged, in contrast to a shoreline-oriented terrestrial or wading species that is only partly submerged while foraging. The submergence, or lack thereof, clearly applies to the forager, rather than to the prey or plants being eaten. A subaqueous forager is thus either a habitually diving predator in pursuit of underwater prey, such as an otter or seal, or a habitually diving herbivore that feeds on underwater plant resources, such as a manatee. Essentially any foraging that occurs fully underwater seems to be included.

Yet the datasets that Fabbri *et al*. presented [15: Table 2] include taxa in the subaqueous forager category that *do not* forage underwater, such as the common hippo (*Hippopotamus amphibius*) [58], pygmy hippo (*Choeropsis liberiensis*) [59], common tapir (*Tapirus terrestris*) [60], Malayan tapir (*Tapirus indicus*) [61], beaver (*Castor fiber*) [62], and European water vole (*Arvicola amphibius*) [63]. Although each of these taxa has secondary semiaquatic adaptations to aquatic habitats, they nevertheless forage substantially—in some cases exclusively—on land and above water. These species habitually enter aquatic habitats as refugia to avoid predators, for thermoregulation, or for other reasons not related to foraging.

In a previous preprint [64], we challenged the classification of hippos and tapirs as subaqueous foragers because they do not forage appreciably underwater. Fabbri *et al*. responded that the term “subaqueous foraging” meant habitual “subaqueous submersion” [16],

Although habitual submersion, as epitomized by the frequent use of subaqueous foraging, is only one functionally important aspect of aquatic behaviour, it is the key aspect that we hypothesized as having a functional relationship to bone density.

This clarifies that the shared property used in their study to assign taxa to the *F0D2* class was frequent full submersion, regardless of diet, predatory behavior, or even foraging at all. One surprising result from our inquiry is therefore that the pFDA analysis of Fabbri *et al*. is unable to infer anything about spinosaurid foraging, and that the conclusions about spinosaurid predatory behavior in that study are unsupported.

In their preprint, Fabbri *et al*. acknowledged that their datasets include taxa that do not forage underwater, but they claimed that “these exceptions are strictly related to a specific diet: herbivory” [16]. However, we found that their training datasets include the American mink (*Neogale vision*) [65] and Pyranean desman (*Galemys pyrenaicus*) [66], both of which have carnivorous diets and eat terrestrial prey—almost exclusively in the case of mink—as well as foraging underwater. Both were included in the *F0D2* femoral and rib datasets. The American alligator (*Alligator mississippiensis*) [67,68] and Nile crocodile (*Crocodylus niloticus*) [69] were also misclassified as subaqueous foragers by Fabbri *et al*., despite ample evidence that adult diets of both species consist of mostly terrestrial prey. Though these large alligators and crocodiles use submersion for concealment while stalking animals on the shore, their lunges above the water to capture prey are clearly not “fully submerged behavior” [67,70–75]. These species take prey both in the water and out of it, to differing degrees, but they are not clear exemplars of “subaqueous foraging.” Meanwhile related crocodilian species that better represent obligate subaqueous foraging, such as the gharial (*Gavialis gangeticus*) [76], were not included by Fabbri *et al*.

Crocodiles also illustrate the complex role of ontogeny in functional assignment, as they grow by orders of magnitude and often exhibit dietary change as they mature [73,77,78]. Crocodilians are not fast pursuit predators and instead tend to be lunging ambush predators [79]. They may be insectivores while very small hatchlings, submerged piscivores at moderate size, and then as large adults transition to a diet that includes terrestrial prey. A scheme that does not specify ontogenetic stage cannot correctly classify such species.

To extend the bone ballast hypothesis broadly, one must understand where different ontogenetic stages fit. Currently it is unclear whether we should expect relevant species to show increased bone density (and thus *Cg*) at all stages of their life history, or only as adults. In the latter case, it will be necessary to verify that data was gathered from specimens at the same ontogenetic stage.

This issue is particularly salient for making inferences about spinosaurids because ontogenetic dietary niche partitioning has also been identified in theropod dinosaurs [80–82]. Like crocodilians, predatory dinosaurs spanned a similar or possibly even larger range of body size from hatchling to adult, and they almost certainly accessed a range of size-appropriate prey [80]. Spinosaurids, a group that includes one of the largest theropod dinosaurs yet discovered, are likely to have sought a sequence of preferred ecological niches during ontogeny [13]. It is also worth noting in this context that the neotype of *Spinosaurus* is thought to be an immature specimen that is substantially smaller than other specimens presumed to be fully grown [14].

Equally important to a discriminant analysis is appropriate selection of a sufficient number of representative taxa for the control group that does not show the behavior of interest—the *F0D0* and *F0D1* classes, in the study of Fabbri *et al*. The *F0D1* group contains just 2 taxa, too few for pFDA or any robust statistical analysis. The *F0D0* group omits many large terrestrial species that capture aquatic prey just under the water surface without being submerged themselves, such as brown bears, black bears, and wolves, all of which prey on swimming salmon [83–86]. Jaguars hunt caiman and capybara both above and below water [87–89]. Taxa such as these would seem a good fit for inclusion as nondiving predators that rely on aquatic prey as a substantial, or even critical, component of their diet [90].

Eagles, ospreys, and other raptors—as well as many other birds such as skimmers and egrets—similarly forage while flying by grabbing fish from under the water surface [91–94]. Herons, storks, egrets, and cranes also forage while standing in shallow water or shoreline perches, plunging their head underwater to capture fish and other aquatic prey [95–97]. This model has been proposed for *Baryonyx* [4] and, more recently, for *Spinosaurus* [13]. A token two examples of taxa that forage in this manner are included in the pFDA training datasets to represent “wading or only partially submerged” foraging behavior.

The *F0D2* dataset, in contrast, includes a wide variety of different foraging styles, including slow-swimming aquatic herbivores, predators of stationary aquatic prey (such as the mollusk-eating sea otter *Enhydra lutis*), fast-swimming pursuit predators, and semiaquatic herbivores and carnivores that do not forage underwater. Though all are classified as frequent divers, that interpretation seems odd in some cases, such as hippos, which do not swim but rather stand in shallow water or walk along the bottom [98].

Fabbri *et al*. offered the following justification for this mix of semiaquatic animals and foraging [15]:

Previous studies applied different categorizations for the characterization of aquatic lifestyles among extant and extinct taxa: ‘aquatic’ and ‘semiaquatic’ were used contra ‘subaqueous foraging’ applied in this study. Our ecomorphological attribution is focused on a specific behaviour linked to an ecology, rather than a categorization of its entirety. We find our categorization to be more accurate: for example, previous studies coded penguins and cetaceans as aquatic, while crocodilians were stated as semiaquatic. Whereas penguins and crocodilians are still ecologically dependent on terrestrial environments (for example, for laying eggs), cetaceans are completely independent from land. On the other hand, all these clades engage in subaqueous foraging. Therefore, our ecological attribution is in agreement with previously applied ecological categories, but do not exclude dependency to terrestrial environments to satisfy autecological requirements, such as reproductive behaviour.

Essentially no evidence was presented to support the utility or “accuracy” of substituting “subaqueous foraging” in place of the more traditional characterizations “semiaquatic” and “aquatic.” The examples we have cited above of animals that dive but do not forage underwater and those that forage underwater without diving show this proposition to be false. Their assertion that “all these clades engage in subaqueous foraging” also overstates the case—the datasets comprise exemplar taxa, not clades. Depending on how broadly one construes the clade for each taxon, most of the clades include members that are terrestrial.

We find that even if the analysis was otherwise correct—and below we present further evidence that it was not—the strongest inference one could draw from the *F0D2* classification with regard to *foraging* is that *Spinosaurus* and *Baryonyx* had a statistical affinity with a group of animals that have semiaquatic or aquatic adaptions and display a wide gamut of foraging styles in and out of the water. One could infer that these spinosaurids fully submerged themselves but may not have been able to swim—although further evidence presented in the next subsection contradicts the possibility of full submersion. Such a vague and tenuous inference seems of little import to the controversy over spinosaurid ecology because it hardly improves on the semiaquatic and piscivorous adaptation that has long been suggested for spinosaurs.

We also find from the application of the bone ballast hypothesis, as described in the literature, that the high *Cg* found in *Baryonyx* and *Spinosaurus* suggests that they probably were not fast-pursuit predators, because such taxa do not typically have high *Cg*.

A key lesson for future studies is that it is of paramount importance that exemplar groups in training datasets actually possess the features that they are claimed to have. Further care must be taken so that the interpretation of the statistical results respects both the dataset composition and the theoretical justification behind it.

#### Bone ballast versus axial pneumaticity.

The bone ballast hypothesis focuses on the role that ballast has in certain groups and lifestyles for secondarily semiaquatic and aquatic amniotes. However, the hypothesis is based on analysis of extant taxa that lack skeletal axial pneumaticity, except for birds [99]. In this subsection, we present results of our investigation into the important question of how the bone ballast hypothesis applies to animals that have significant pneumaticity, a question highly relevant to inferences about spinosaurids.

Pneumaticity in theropods, including *Spinosaurus*, has a strong effect on body density because pneumatic invasion replaces soft tissue or bone, which has density ranging from 1.0 to 1.2 g/ml, with air that is a thousand-fold less dense—about 0.0012 g/ml at sea level and 15°C. Cancellous or dense bone infilling, by comparison, replaces soft blood vessels/marrow of density near that of water (1.0 g/ml) with bone of only slightly greater density (~1.2 g/ml). Pneumaticity in the axial and appendicular bones in theropods (including birds) thus *increases* buoyancy by roughly 5 to 6 times more than a comparable volume of “dense” bone *decreases* it.

Studies of pneumaticity in birds as a correlate to lifestyle show that pneumaticity is positively correlated with body mass in flying birds; heavier birds have higher pneumaticity [100]. Pneumaticity has been lost in multiple lineages of diving birds [99]. Smith performed phylogenetic regressions to show that the pneumaticity is strongly correlated with lifestyle among water birds [101]. Pelicans feed primarily at the surface without complete submersion or only shallow diving, but they are not found to be apneumatic in any of the analyses. They are, in fact, highly pneumatized [99]. However, there is a strong correlation between decrease or loss of pneumaticity and pursuit diving in birds, a correlation seen both in flying taxa (loons, grebes, darters) and in flightless species, such as penguins [101].

Loss of pneumaticity, or reduction in its degree, increases body density and acts as ballast, a fundamental biomechanical effect compatible with the bone ballast hypothesis. However, density acquired through loss of pneumaticity may reduce or obviate the need for bone density increase from denser long bones and ribs. The lesson from birds is that the bone ballast hypothesis is most strongly observed in reductions of pneumaticity rather than pachyostosis or osteosclerosis of long bones and ribs.

Pneumaticity in birds is relevant to spinosaurids because spinosaur bone structure provides ample evidence of vertebral pneumaticity, which would supersede any ballast effect from variable infilling of long bones [8,13,14]. Spinosaur fossils also exhibit large medullary cavities (presumably filled with fat during life) that hollow the centra at the base of the tail and would have further reduced bone density [14]. The internal volume of cervical pneumaticity (~25% by volume) is well documented in *Spinosaurus* [102], with evidence that the entire dorsosacral column is pneumatized (Fig 3). In *Suchomimus* and *Baryonyx*, most precaudal vertebrae have internal pneumatic chambers (camerae) within the centra and deep fossae likely for pneumatic diverticulae on the neural arch (Fig 3B–3D).

(A) *Suchomimus tenerensis* (MNBH GAD500) precaudal column and pelvic girdle showing pneumatic features in (B) D2 in lateral view with coronal (B1) and axial (B2) CT cross sections, (C) D13 in lateral view with axial (C1) and sagittal (C2, 3) CT cross sections, and (D) S2 in ventral view with axial (D1) and coronal (D2) CT cross sections. (E) *Spinosaurus aegyptiacus* precaudal column and pelvic girdle showing pneumatic features in (F) ~D2 in lateral, anterior, and dorsal views with coronal (F1, 2) and axial (F3) CT cross sections, (G) ~D6 in dorsal and lateral views showing coronal (G1) and axial (G_{,}2, 3) CT cross sections, (H) ~D8 in dorsal and lateral views with axial (H1) and coronal (H) CT cross sections, and (I) S3 centrum in ventral and lateral views with coronal (I1) and axial (I2) CT scan sections. Neotypes FSAC-KK-11888 (panels G, H, I) and BSPG-2006-I-54 (panel F). CT section lines are color-coded by orientation (*magenta*, coronal; *blue*, axial-horizontal; *black*, sagittal/parasagittal). Scale bars are 10 cm. Abbreviations: bs, bony septum; c, cervical vertebra; cmr, camera; d, dorsal vertebra; for, foramen; fos, fossa; nc, neural canal.

Precaudal vertebral pneumaticity is present in *Spinosaurus* to an even greater degree than in its baryonychine relatives *Baryonyx* and *Suchomimus*. The pneumatic foramina and camerae in the anterior dorsal vertebrae (Fig 3F) are larger than in *Suchomimus*, and mid-dorsal centra have marked, oval pneumatic fossae that reduce intervening bone to a thin sagittal septum (Fig 3G and 3H). Similarly, midsacrals have large pneumatic foramina and internal camerae (Fig 3I).

The bone ballast hypothesis relies on bone density influencing overall body density. In a mammal or reptile, it may be reasonable to infer a trend from a sample of rib and/or femur density (or a proxy such as *Cg*), if one assumes that the sampled bone’s density is representative of a trend followed by other skeletal elements. In a bird or dinosaur with vertebral pneumaticity, however, this is not the case. The contribution of the skeletal elements to overall body density depends on both their density and their volume. The impact of the air sacs involved in pneumaticity, for example, depends on the total volume of the air sacs. One cannot infer the degree of pneumaticity by sampling skeletal elements in which is not present. Even sampling bones that do exhibit pneumaticity does not allow computation of the buoyancy effect unless the total volume of those bones is also measured or estimated.

The quantitative impact of vertebral pneumaticity in *Spinosaurus* is so strong that calculations of body density from 3-D flesh models have found specimens of this taxon to be unsinkable [8,14]. In water, the buoyancy of the air sacs and pneumatic diverticulae would exert an upward force so strong that not only would it exceed any plausible ballast effect of dense ribs and femurs, but it also could not plausibly have been overcome by thrust generated from the tail and/or limbs [14]. *Spinosaurus* could not have fully submerged.

If the evolutionary pattern found from the analysis of multiple clades of extant diving birds—that fast-swimming pursuit diving is correlated with reduced pneumaticity [101]—holds for theropod dinosaurs as well, then the extensive vertebral pneumaticity in *Spinosaurus* can be seen as evidence to reject the fully aquatic pursuit predator hypothesis.

By focusing solely on *Cg* in femora and ribs, the analysis of Fabbri *et al*. was effectively blind to vertebral pneumaticity, the most important factor for the bone ballast hypothesis in birds and dinosaurs and a key difference that distinguishes these groups from mammals and reptiles. We find that the omission of pneumaticity causes classification by femoral or rib data alone to be misleading, and any inferences drawn from such analysis to be invalid.

Integrating pneumaticity into a future study would be possible, in principle. Quantitative data on pneumaticity is available for many taxa of extant birds [100], and studies have started tracking pneumatic diverticulae via CT scans [103,104]. Integrating pneumaticity into a *Cg*-based study of the bone ballast hypothesis could be difficult in practice, however. If *Cg* has a direct correlation to body density, it can be used as a proxy—but only if other contributing factors to body density (flesh density, lung volume, etc.) do not confound the correlation, as is generally thought to be the case for reptiles and mammals. In any animal that has significant skeletal pneumaticity, the confounding effect of the pneumaticity is strong and not captured by *Cg*, even if *Cg* is measured for the skeletal elements in question. Instead of relying on *Cg* as a proxy for body density, one would need to quantify the relative impact on buoyancy of increased *Cg* in some skeletal elements—including long bones and ribs, but possibly others as well—while also accounting for the total volume of the bones as compared to the total volume of air sacs. Such a calculation is difficult to make because it depends on accurately estimating the volumes of bones, flesh, and air sacs. Rather than simply measuring *Cg* in femurs and ribs, a study that accounts for pneumaticity would need to make detailed, accurate 3-D models of the entire skeletal, flesh, and air volume structures for every taxon in the dataset.

#### Body mass confounds classification based on *Cg*.

We examined the possibility that lifestyles and body characteristics other than buoyancy may act as confounding factors in a classification based on bone compactness, biasing the results if they are not properly controlled in the statistical analysis. Our review of relevant literature found several plausible confounders. Burrowing animals may have increased *Cg*, particularly in limbs used for digging [105]. Some arboreal groups, such as sloths, also have increased *Cg* [106]. More relevant to spinosaurids, body mass has also been associated with bone density. Studies of large-bodied terrestrial taxa often have reported increased *Cg* [52,53,55,56,107–111].

The effect of large body size was well considered in the common hippo by Houssaye *et al*. [52]:

However, it is difficult to determine whether the pattern observed in

Hippopotamusreflects its graviportal limbs or the benefit of a slight increase in bone mass in its legs enabling their use as ballast and offering stability in water. As a result, both adaptations might be mistaken, or even synergistic, and it seems almost pointless to try to unravel their evolutionary integration. Adaptation to a graviportal limb morphology should thus be taken into consideration when analyzing possibly amphibious taxa displaying a terrestrial-like morphology, and thus notably in the study of the early stages of adaptation to an aquatic life in amniotes.

*Spinosaurus* and other spinosaurids are in the top tier of body mass among theropods [14], so the potential for increased *Cg* as a consequence of large body size must be considered as a viable alternative to the bone ballast hypothesis as an explanation of the observational data. Future investigations could add a separate categorical variable for large body mass to see whether that improves the classification of test taxa, for example. But the admonition above to take adaptation to large body size into consideration was explicitly *not* heeded by Fabbri *et al*., as their approach forced large-bodied taxa into either the *F0D0* or *F0D2* classes—*Hippopotamus* was assigned to the latter.

Large-bodied taxa, such as the African elephant *Loxodonta*, which has *Cg* comparable to *Baryonyx*, and other extant (Asian elephant, rhinoceros) and extinct (mammoth, extinct hippos) mammals, were included in ds1 and ds2 but were not separated into a separate class to facilitate comparison or control of the confounding factor. Most large-bodied taxa were purged from the the ds3 and ds4 training datasets under a flawed rationale, which we examine in the next subsection. Large-bodied dinosaurs in their *Cg* datasets were classified as “*D* = unknown,” thereby excluding them from the training set; all non-avian dinosaurs were used only as test taxa.

The analysis may also have been confounded by inclusion in the datasets of many taxa that are small, even minuscule: the smallest have femoral diameters <1 mm and body masses ≤7 g. *Spinosaurus* achieved masses approximately 10^{6} times larger [14]. In ds1, the median femoral diameter is 12.08 mm; half of the dataset has a *smaller* diameter. The ds1 exemplar taxon with the femoral diameter closest to the median is *Taxidea taxus*, the American badger. Typical body mass of this taxon is 6–9 kg; *Spinosaurus* weighed roughly 1000 times more. No argument was provided to justify the use of such small-bodied taxa as biomechanical exemplars for spinosaurs.

Disparities between classes, which can influence classification, are notably large in their study. The median across the femoral *F0D0* class is 11.5 mm (*Meles meles*, the European badger); that is only about 60% of the median of the *F0D2* class, which is 19.1 mm (*Neusticosaurus*, an extinct pachypleurosaur). In an LDA analysis, such a disparity would strongly bias the decision boundary toward lower *Cg* values for taxa that have *MD* greater than the centroid values. We examined whether the adjustment for phylogenetic bias by pFDA mitigates this bias and found that it does not, as detailed in the next subsection.

Another possible confounding factor for the bone ballast hypothesis is ballast by other means—such as ingesting gastroliths, a behavior known to occur in crocodilians, plesiosaurs, and possibly others [54,112–114]. If swallowed stones provide ballast, skeletal modifications may not be adaptive for diving predators. Evidence exists that some clades of nonavian dinosaurs, including theropods, used gastroliths [115,116]. But the more salient confounding influence would be the presence of gastrolith-dependent taxa in the training sets [116]. Further consideration of gastroliths is beyond the scope of the present study.

We find that the combined impact of the limitations described above leaves increased *Cg* in *Spinosaurus* unexplained: it may be a secondary semiaquatic adaptation (under the bone ballast hypothesis), a consequence of its large body size and/or body mass, or conceivably a combination of the two, as with hippos. In limiting their analysis to the classes “subaqueous foraging” versus not and ignoring plausible confounders such as pneumaticity, body mass, and bone strength or stiffness—considered a biomechanical correlate to bone density in the literature for other taxa [117–119]—Fabbri *et al*. failed to account for other possibilities.

Any future study seeking to use the bone ballast hypothesis for clades of dinosaurs which have pneumaticity must address the confounding effect of body size on *Cg*. Other possible confounding factors should also be investigated and tested via statistical methods to confirm that they are not the cause. The taxa chosen should not be widely disparate between classes in key biomechanical attributes, including body size—especially if a proxy for body size, such as *MD*, is one of the variables.

### Issues with dataset composition

Separate from the data selection issues that involve the bone ballast hypothesis, our examination identified several problems related to dataset composition. We show below that one of the variables used by Fabbri *et al*. should have been omitted, as it reduces the statistical power of their analysis. We also found that removal of deep-diving and graviportal taxa from the training datasets was performed inconsistently and using subjective judgments that appear unjustified. This subsection addresses these issues, and their consequences, in turn.

#### Unnecessary inclusion of *MD* in the analysis.

The justification for using *Cg* as an independent variable is based on the bone ballast hypothesis explored above. We turn now to the use of *MD* (in the form of log_{10}(*MD*)) and why it is included. One possible reason would be to explicate the role of body mass, using femoral diameter as a proxy. This reasoning is undercut by several factors: the lack of suitable large body mass exemplars in the ds1 and ds2 training sets; the removal of most of the exemplars that could serve that purpose in the ds3 and ds4 training sets; and the fact that the training sets are biased to include taxa that have much lower body mass than the test taxa do.

In their initial round of analysis, Fabbri *et al*. used PGLS to test whether the categorical variable for “subaqueous foraging” predicts *Cg*. As a follow up, they performed a phylogenetic ANOVA with all possible pairs of variables, including “subaqueous foraging” and *MD*, as well as “subaqueous foraging” and flight, to see which combinations predict *Cg* best (details in Materials and methods).

The results of including *MD* and *Cg* in the regression were described, and values of the Akaike Information Criterion (AIC) were compared with a regression that included *Cg* alone [15]:

Models that include flight or shaft diameter as additional covariates receive less support from AIC (Table 1, Supplementary Tables 3 and 4). This indicates that evidence for an amniote-wide common allometry in bone density, or for association of flight with decreased skeletal density aquatic adaptation (see Table 1, S8 and S9 Figs, Supplementary Tables 3 and 4).

A model that has less support from AIC is one that offers lower explanatory power. Their calculations show that, for the femur dataset, the model that includes subaqueous foraging and *Cg* and *MD* has (under the assumptions behind AIC weights) about *49 times less* explanatory power than a model that includes subaqueous foraging and *Cg* without *MD*. For the rib dataset, the model including *Cg* and *MD* similarly has AIC weights 34 times smaller than those of the model that includes *Cg* without *MD*. (Their position on evidence for “common allometry” remains unclear because the sentence seems to have been truncated in publication.)

Conventional statistical analysis would not further consider *MD* after results such as these that show that adding *MD* dramatically *reduces the model’s explanatory power*. Fabbri *et al*. proceeded to use the inferior model nonetheless, possibly because pFDA requires at least two independent variables. After *Cg*, *MD* is the best-performing of the remaining variates. We find that the analysis should have switched to one of the many statistical methods that could be used to analyze *Cg* alone, recognizing that pFDA is not an appropriate choice, for this reason as well as others we have noted above.

The AIC results also show that the variable *F* also decreases the explanatory power of the model, which should have constrained the class comparison more tightly to *F0D0* versus *F0D2*. Instead, Fabbri *et al*. used the statistically inferior comparison between “subaqueous foraging or not,” leaving the flying taxa (*i*.*e*., *F* = 1 and *F* = 2) in the analysis, as is clear from the data files listed in Table 2 and provided in our Supporting information.

Statistical arguments aside, it is puzzling that a pFDA analysis classifying spinosaurs and other nonavian dinosaurs included flying taxa in its training datasets. None of the dinosaurs in the test taxa has been proposed as being able to fly, so flying taxa are not reasonable biological analogs for *Spinosaurus*. The inclusion of irrelevant taxa risks adding both random and possibly non-random variation that confound the correlations—indeed, the weak AIC results show that it did have that effect. If this variation is unequally distributed across the two classes in the classification, this could bias the classification.

Like many of the other issues raised in this study, the use of models proven to have less statistical evidentiary power is sufficient on its own to call the whole analysis into question. The lesson for future studies is to follow the evidence. If including a variable in the analysis reduces explanatory power, do not use it. The point of using AIC or related criteria to compare models is to identify counterproductive variables so that they can be avoided.

#### Removal of “deep diving” taxa.

We found that Fabbri *et al*. misstated that the bone ballast hypothesis holds for all amniotes. A later section of their paper concedes this point [15]:

Deep diving animals, such as ichthyosaurs, mosasaurs, living cetaceans and seals, are characterized by lower bone density when compared to shallow-water subaqueous foragers: the compact bone cortex of deep divers is replaced by cancellous bone characterized by extensive trabeculae and vascularization.

They attempted to address the inconsistency by manually removing such taxa from dataset ds1 and ds2 to create the smaller datasets ds3 and ds4, which they analyzed separately [15]:

High bone density is therefore an excellent indicator for the initial stages of aquatic adaptation, but poorly distinguishes between wading, deep diving, and terrestrial habits. These limitations can be overcome using anatomical observations because deep diving shows other transformations of the body plan, such as presence of fins and flippers.

Fabbri *et al*. did not code their datasets to include stages of aquatic adaptation. We therefore find that the data cannot be used to infer a correlation with *Cg* that is limited to some stages of aquatic adaptation but not others. Although in principle such a study could be done, they neither performed it themselves nor referenced such a study by others, leaving their statement without any support. Their statement ignores complexities that could confound the approach they suggest, such as the fact that sirenians have high *Cg* but would be difficult to characterize as in the “initial adaptation” to aquatic life, considering that they have already lost their legs. If high bone density (and thus *Cg*) cannot reliably discern subaqueous foragers from wading, deep-diving, and terrestrial taxa, then it is unclear how it could serve as an “excellent indicator” of the early stages of adaptation because wading, diving, and terrestrial taxa are the primary alternatives to compare against.

We find that *any* finned or flippered taxa are poor choices as exemplars for comparison to spinosaurids, which manifestly *do* have terrestrial limbs, and should not have been included in the dataset. Even if one supposes that spinosaurids were on an evolutionary trajectory to become fully aquatic (a highly speculative idea, as no fully aquatic descendants have been discovered), the best points of comparison would still be other taxa that have terrestrially useful limbs and are at an early stage of secondary aquatic adaptation. The datasets that Fabbri *et al*. assembled do not feature such taxa.

The literature on the bone ballast hypothesis does not support deep diving as the sole or primary correlate of low *Cg* in secondarily aquatic taxa. Instead the focus is on low *Cg* among taxa that are fast-swimming and pursuit predators and on high *Cg* among slow swimmers, herbivores, and bottom walkers [50,54].

The vague criteria for removal described by Fabbri *et al*. are not mutually exclusive; air-breathing deep divers are often fast swimmers in order to reach depth while holding their breath, and they are often (but not always) pursuit predators as well. Examples include most cetaceans and extinct ichthyosaurs, plesiosaurs, and their kin. Moreover, not all taxa that show the anatomical features identified were removed by Fabbri *et al*. from their analysis. We compared their unpublished data files (S3 and S4 Files) to the tables that list taxa that were eliminated as deep divers [15: Supplementary Tables 5 and 6]. We found that one of the published tables is incomplete, omitting two taxa that were removed from the rib dataset (see Materials and methods). Our replication study confirmed that Fabbri *et al*. used the data files, not the published tables, to produce their results.

We also found that some taxa listed as deep divers in the data files do not meet the anatomical criteria that Fabbri *et al*. specified: transformations of the body plan or the presence of fins and flippers. The extant cetaceans Bryde’s whale *Balaenoptera brydei* and orca *Orcinus orca* were classified as deep divers and thus removed from ds3 and ds4, whereas the flippered extinct whale *Basilosaurus* was not listed as a deep diver and was thus retained in the analysis. The extinct seal *Callophoca obscura* was removed, yet the extinct seal *Nanophoca vitulinoides* was retained. The plesiosaur *Cryptoclidus* was retained, despite its flippers and recent work suggesting an open-ocean lifestyle for many plesiosaurs [120]. More generally, we found that data for sirenians, plesiosaurs, ichthyosaurs, and nothosaurs were retained, despite their flippers, fins, and flukes. So were penguins, even though they have flippers and are deep divers—*Aptenodytes* being known to routinely dive deeper than 400 meters if required for foraging [121,122]. On the other hand, the extinct *Desmostylus hesperus* was removed, despite being a quadruped, with no fins or flippers. We found no suggestions in the literature that this species engaged in deep diving.

We found that the taxa flagged as deep divers by Fabbri *et al*. do all share a feature that, were it used as a deselection criterion, would have biased the analysis: they are the members of the *F0D2* group that have the lowest *Cg* values. We also found that taxa that should have been removed by their criteria but were retained without explanation have high *Cg* values. *Cryptoclidus*, for example, has *Cg* = 0.97 and *MD* = 84.08, almost the same values as *Spinosaurus* (*Cg* = 0.968, *MD* = 81.52).

We find that the removal of low *Cg* taxa from the *F0D2* group, and the retention of high *Cg* taxa that meet the anatomical criteria for removal, increased the contrast between *F0D0* and *F0D2* and biased the classification of spinosaurs toward *F0D2*. Removing data points simply to improve the appearance of correlations is a form of data manipulation that violates standards of statistical practice.

Fully aquatic taxa with fins and flippers should not be used as points of comparison for spinosaurs, which had functional legs and feet. The bone ballast hypothesis does not suspend normal critical judgement about anatomy. It is not a universal rule across amniotes; instead its correct interpretation depends on knowing much more than just *Cg* and *MD*. In particular, taxa that do not use bone as ballast are not simply deep divers—they also include fast swimmers and pursuit predators.

#### Removal of “graviportal” taxa.

In addition to removing deep-diving taxa from the ds3 and ds4 datasets, Fabbri *et al*. removed graviportal taxa, which they claimed to select by applying the following criterion [15]:

Graviportal animals can be distinguished from aquatic species by the presence of columnar limbs, an anatomical trait which is generally missing among subaqueous foragers.

They acknowledged in their paper that high body mass may also lead to elevated *Cg*, as we discussed in a previous subsection, but the paper did not explore that likely confounder with statistical tests or other methods.

The term “graviportal” has no universal definition and has been used variously to indicate particular bone length ratios, posture, locomotive mode, and limb articulation [107,111,123,124]. Originally it referred to the relative length of upper and lower limb bones [125], but it is also referred to as a posture or mode of terrestrial locomotion. More recent studies, however, have shown that both posture and locomotor mode lie on a continuum, with position better captured by osteological indices, such as ratios of length to width in long bones [107,108,123,126,127]. Some reserve the term “graviportal” exclusively for a mode of quadrupedal locomotion, whereas others outline criteria for “graviportal bipeds” [110,128–130].

*Spinosaurus* meets those bipedal criteria for being graviportal, consistent with findings that it was bipedal [14]—findings that overturned earlier suggestions that it was quadrupedal [5]. But even if *Spinosaurus* were quadrupedal, limb-ratio tests would classify it as graviportal. *Baryonyx* and *Suchomimus* are both considered bipeds and qualify under the bipedal criteria. Other than their novel and unsupported criterion, Fabbri *et al*. provided no justification for removing graviportal exemplar taxa from a study of graviportal spinosaurids.

Fabbri *et al*. asserted that “graviportality does not affect rib compactness” [15], but our literature review identified two recent studies suggesting that bone density in these two skeletal components are often correlated [46,131].

We find the precise definition of graviportal to be irrelevant to the question of *Spinosaurus* ecology because all large-bodied animals tend to have higher bone density and higher *Cg*, irrespective of their posture or mode of locomotion [53,108,109,132].

As with their removal of deep-diving taxa, Fabbri *et al*. presented a succinct anatomical criterion for removal of graviportal taxa but failed to apply it consistently. Three rhinoceroses (*Ceratotherium simum*, *Rhinoceros sondaicus*, *Rhinoceros unicornis*) were removed from datasets ds3 and ds4, despite having flexed rather than columnar limbs, the ability to gallop, and other nongraviportal characteristics. They do have high *Cg* for a terrestrial animal [52], however. The extinct hippopotamus *Hexaprotodon garyam* was culled, despite distinctly flexed limb postures. The common hippo *Hippopotamus amphibius* was retained (and assigned to the habitual diving group *F0D2*), yet the pygmy hippo *Choeropsis liberiensis* was eliminated as graviportal. The ichthyosaur *Mollesaurus* was also eliminated as graviportal, somehow meeting the criterion for columnar limbs despite having no legs.

If the goal in removing a swathe of taxa, under the dubious rationales of graviportal body type and deep-diving behavior, was to improve the apparent accuracy of the pFDA analysis, the approach had the intended effect. Fabbri *et al*. reported that the correct classification rate improved from around 84% with the complete datasets to 90% [15] for the selectively culled datasets ds3 and ds4. Confidence intervals for the latter analysis widened, however, as a result of the smaller sample size, as we show below in a subsection reporting results from our analysis of the effects of training set size.

We found that the removal of graviportal taxa from the terrestrial *F0D0* class had the effect of removing large-body-mass exemplars from the comparison. That choice of method effectively precluded a proper consideration of a highly plausible alternative hypothesis: that spinosaurs had high *Cg* because they were heavy. The fact that the spinosaurs themselves would be classified as graviportal by current metrics for the condition—and thus eliminated from the analysis—clearly renders the results of that analysis unusable.

*Cg* disparity between extinct and extant taxa.

Fabbri *et al*. used *Cg* data from a combination of extinct and extant taxa, and they chose an analytical method that implicitly assumes that the statistical distributions of *Cg* are the same for extinct taxa and extant taxa. If the data violate that assumption, that could bias the classification results directly, so one could not draw valid conclusions from the analytical results (see Materials and methods).

In the original pFDA study [21], Montani and Schmitz included only extant taxa in the training dataset that they used to classify extinct test taxa. In that study, however, the test variates were eye-socket dimensions, which have a strong basis in optics, so there was little concern on the matter. In a review of the literature through 2022, we found that all other pFDA studies also made exclusive use of extant taxa for training. Classification of extinct species with an algorithm trained on measurements of extant species is vulnerable to a systematic difference in distribution between the two groups. There is less of an impact on classification if *all* members of each class in the training set have the same status as extinct or extant, however.

In the mixture of extant and extinct exemplar taxa assembled by Fabbri *et al*., the ratio of extinct to extant taxa varies considerably among the training subsets *F0D0*, *F0D2*, etc. We compared the statistical distributions of *Cg* between the two groups and found striking differences in bone compactness between extinct and extant taxa of similar lifestyle in some of the datasets (Table 3). The femoral *F0D2* dataset, for example, shows a strong bias toward higher values of *Cg* among extinct taxa. We ranked the *F0D2* group by *Cg* and found that 20 of the top 21 taxa are extinct (Table 3). The two extant taxa having the highest *Cg* rank 16 and 22 in this dataset (Table 3). The disparity is particularly worrisome because spinosaurids were clearly nonfliers and therefore must either have been nonflying divers (*F0D2*) or terrestrial (*F0D0*).

*Cg* might potentially be higher among extinct taxa as a result of several factors: secondary mineral deposition/precipitation in porous bone during fossilization; difficulty in measuring *Cg* in cases where the rock matrix is hard to distinguish from bone; repair to damage in fossil specimens; the specific choices of extinct taxa included; or other reasons. Whatever the causes, the effect is very strong and clearly presents a potentially confounding factor for both the bone ballast hypothesis and classification via pFDA.

Among the 59 specimens in *F0D2*, many more represent extinct (43 or 72.9%) than extant species (16 or 27.1%). We evaluated whether the observed imbalance could be due to random chance by first applying a permutation test of *Cg* rank, with the null hypothesis being no difference between the *Cg* values of extinct versus extant taxa (see Materials and methods). We calculated *P* values for the null hypothesis that the rank distributions of extinct and extant are the same (Table 4). We then performed a second, “coin-flip” test, using a binomial distribution to determine coin-flip *P* values for an alternative null hypothesis that the counts of extinct and extant specimens in each group resulted from random chance. These tests were performed on all four dataset variations (Table 1) for both *F0D0* and *F0D2* subsets of femur and rib data.

The permutation tests on *F0D2* femoral data from ds1 and ds3 have *p* ≤ 0.0011; we therefore reject the null hypothesis that the distribution of *Cg* values is the same for extinct and extant taxa (Table 4, shaded). The test result for the *F0D0* rib data from ds2 similarly rejects the null hypothesis with high probability (Table 4, shaded). Overall, these statistical tests and *P* values demonstrate that the distributions from which *Cg* is drawn differ for extinct and extant taxa, at least for these datasets.

We find that the foundational assumption that *Cg* can be used as a marker for both groups is violated and that the resulting classifications are biased by differences in the distributions of *Cg*. In the ds1 (femoral) dataset, pFDA assesses the statistical affinity of test taxa with the “subaqueous foraging” class *F0D2* (shown in Table 3) and with the terrestrial *F0D0* class. *F0D2* is 72.9% extinct and 27.1% extant, but for *F0D0* the imbalance is reversed: 8.5% extinct and 91.5% extant. The large imbalance in extinct versus extant, coupled with the fact that the *Cg* distributions are not the same (*i*.*e*., the null hypothesis of the permutation test is rejected) makes classification using these datasets suspect.

The results of the coin-flip tests on all four *F0D0* datasets further indicates that it is extraordinarily unlikely that the pronounced imbalances between extinct and extant taxa in these datasets are the result of random chance (Table 4, shaded results). In the rib datasets ds2 and ds4, it appears plausible that the *F0D2* data were randomly selected from both extinct and extant taxa. That is also the only subset for which the permutation test does not reject the *Cg* distribution. However, pFDA compares this seemingly balanced set to the rib *F0D0* subset, which has an extreme extinct/extant imbalance (3.2% to 96.8%) and highly significant rejection of the null hypothesis that the *Cg* rank distribution is equal. Since both class datasets must be valid for the pFDA test to be valid, we find the results of this classification also to be suspect.

The results of our statistical analysis cast grave doubt on the validity of classifications made with the datasets employed by Fabbri *et al*. Whether the marked discordance between the distributions of *Cg* for extinct and extant taxa arises due to biological differences or measurement or selection bias is immaterial to its impact in undermining confidence in the classification of the spinosaurids. Our additional finding that the imbalance between extinct and extant taxa is strongly biased in opposite ways for the two classes raises further concerns about that classification result.

Any future pFDA study that mixes extinct and extant taxa, or any other distinct groups, should explicitly list assumptions about the distribution of variates across the groups and then test those assumptions to ensure that there is no possible bias in the result.

#### Ignored and redundant taxa.

When constructing datasets for comparative analysis, investigators must inevitably make choices and handle pragmatic issues, such as the availability of specimens from the literature or from collections. Fabbri *et al*. incorporated 78 taxa in the ds2 rib dataset from Canoville *et al*. [46] but ignored an additional 43 extant species from that study that are potentially relevant from a comparative basis and would merit inclusion, including varanids that range from semiaquatic (*Varanus salvator*, the water monitor) to large-bodied terrestrial (*Varanus komodoensis*, the Komodo dragon). We found that 15 taxa that Fabbri *et al*. did use from that prior study have *MD*<2 mm, which makes them much less relevant for comparison to enormous spinosaurs.

In contrast, the Triassic aquatic reptile *Nothosaurus* and its close relatives (three genera) are overrepresented, accounting for ~15% of the femoral (*F0D2*) dataset and 21% of the extinct taxa (Table 3). *Nothosaurus* is represented by six specimens. We found that the value of *Cg* in *Nothosaurus* is significantly negatively correlated with *MD* (S1 Fig; *R*^{2} = 0.84). The bone density data for this taxon would thus not scale to the body size of a spinosaurid (because *Cg* would drop to near zero). Some studies have suggested that larger nothosaurs may have adapted to ecosystems and active swimming lifestyles; if so, that might be related to this phenomenon [135,136].

Whatever the cause, the strong negative allometry of *Cg* with *MD* suggests that this is yet another confounding factor complicating the use of *Cg*. We find that the clade should have been dropped from the training dataset; instead, they are overrepresented. Negative allometry of *Cg* with *MD* tends to bias the decision boundary downward. As the spinosaurids are near the top of the distribution of *MD*, this effect could bias their classification toward the *F0D2* class.

Fabbri *et al*. offered no rationale in their paper to justify the choices they made to include and exclude taxa. Future studies of this kind should set reasonable inclusion criteria based on sound biological and statistical reasoning and evidence, and then apply those standards objectively.

#### Variation in *Cg*.

Fabbri *et al*. used measurements made on just two skeletal elements, and in most cases they represented an entire taxon by a single value of *Cg*. Their analysis failed to account for uncertainty in the *Cg* measurements but, perhaps even more problematically, tacitly assumed that it is valid to draw quantitative conclusions about a taxon from one measurement made on a single specimen. This subsection discusses results from our examination of the roles of uncertainty in *Cg* measurement and variation in bone compactness, both within and between specimens.

Prior research has now established that significant variation exists in *Cg* values unrelated to ecology or behavior. Biological factors that might affect *Cg* in dinosaurs as well as relevant extant taxa include: developmental variations among individuals of a species; the sex of the individual; changes in bone compactness that occur during normal ontogeny; variations in *Cg* among skeletal elements; and even variations among different locations along the shaft of a single bone [132,134,137–139]. Diagenetic and taphonomic factors—including fracturing, deformation, infilling, and external erosion—can also introduce variations in *Cg* measurements.

Measurement error is present in any biological parameter and may similarly accrue from several sources, such as the calculation of *Cg* from thin sections or CT scans, decisions taken in thresholding images, and the degree of repair of cracks and missing bone in damaged or incomplete specimens.

We searched the literature for systematic studies that have examined how *Cg* varies across the various possible sources of biological variation or measurement error. Finding none, we asked workers in the field of bone microanatomy if they were aware of studies that have quantified such variations. We were told that there have been no such systematic quantitative studies published to date.

Our search identified just one report that included more than a handful of *Cg* values for a single taxon. That paper focused on the manatee *Trichechus manatus latirostris* [140]. Domning and de Buffrénil measured *Cg* values in ribs from thin sections in 12 individuals that included males and females, as well as growth stages from 50 to 1057 kg. *Cg* values ranged from 0.8389 to 0.9962, with a mean of 0.9109 and a standard deviation of 0.0417, a relative range of 17.3% (relative range: high minus low values, divided by the mean and then multiplied by 100). Excluding the youngest (and smallest) three individuals reduces the size range (161 to 1057 kg) but has a trivial impact on the range of *Cg* variation (13.2%).

As a step toward a multitaxon study, we compiled multiple measurements of *Cg* from multiple individuals within or across studies for all taxa present in the datasets used by Fabbri *et al*. (Table 5 and references therein). We did the same for taxa in a recent study of flightless birds that included multiple individuals of the same species (Table 6) [110]. We find that multiple measurements of *Cg* in the same bone of the same species often exhibit relative ranges exceeding 10%. The median relative range among the entries in Tables 5 and 6 is 18.6%.

Median variation of 18.6% is a very large percentage, given the limited range of *Cg*. For example, the mean value of *Cg* in the ds1 *F0D0* subgroup is 0.610, whereas for ds1 *F0D2* it is 0.840—a relative range of 31.8%. The mean for ds2 *F0D0* is 0.653, and for ds2 *F0D2* it is 0.827—a relative range of 23.6%. To place this in context, the variation among individuals is more than half (58.6%) of the variation between the *F0D0* and *F0D2* groups for femora, and it is more than three-quarters (79.0%) of the intergroup variation for ribs.

As noted above, Fabbri *et al*. reported *R*^{2} coefficients from their PGLS regression on the taxa with *D* = 2 (“subaqueous foraging”) that indicate that *D* explains about 17.2% of the total variation in *Cg* for the femoral dataset and 10.8% for ribs. The 18.6% median value for *Cg* variation between individuals exceeds both of those *R*^{2} values. We thus find that the variation among individuals could be larger than the differences in *Cg* that Fabbri *et al*. found between subaqueous foragers versus not.

Most of the taxa in Tables 5 and 6 are represented by only two or three specimens. Better assessment of variation in *Cg* will require larger sample sizes for *Cg* among conspecifics and across a greater range of taxa.

It is possible that larger samples of individual variation and studies on more species would show the median variation found here to be atypical. But even a few taxa that exhibit large relative ranges—such as the maximum range in *Cg* observed in flightless birds (47.5% in femora of *Rhea americana*, Table 6)—could bias a discriminant analysis, whether LDA or pFDA. The available dataset of cases in Tables 5 and 6 is too small to accurately characterize variation, and at present we cannot determine the sources of variation. We do have enough information, however, to caution researchers about the degree to which individual variation could bias or invalidate statistical analyses.

Until the variation in *Cg* is better characterized, extra care must be used in any analysis that attempts to use *Cg* to classify taxa—whether by pFDA, LDA, or other statistical methods. Qualitative descriptions or broad observations of increased *Cg* in some taxa versus others may not be impacted, but statistical methods that hinge on the precise value of *Cg* are very much at risk of being affected.

### Variable *Cg* and infilling

Medullary cavities in long bones of the fore and hind limbs of *Spinosaurus* are variably infilled (Fig 4B and 4C). Fabbri *et al*. based their estimated *Cg* for *Spinosaurus* on one thin section taken from one fully infilled subadult femur (Fig 4D). A second femur of *Spinosaurus* [145] (Fig 4A and 4B) is slightly larger than the infilled neotypic femur (Fig 4C) but has a significant medullary cavity lined with cancellous bone that would register as significantly less dense in thin section at midshaft. Only two subadult femora are available for *Spinosaurus*—too few to generalize whether such variation is common or rare. In extant birds, intraspecific variation has also been recorded in the volume and location of medullary cavities [146]. These examples underscore the need to sample species more broadly rather than to accept a single measurement of bone compactness as representative of a given species.

(A, B) proximal one-half of an isolated right femur in anterior view and distal midshaft cross-sectional views (CMN 41869); (C) CT scan of the left femur of the neotype with eight cross sections (FSAC-KK 11888); (D) CT scan of an isolated right phalanx I-1 in sagittal cross section (UCRC PV8). Abbreviations: at, anterior trochanter; h, head; mc, medullary cavity. CMN 41869 images provided by Jordan Mallon.

Some evidence shows that *Cg* can vary significantly with position along the shaft of long bones or ribs as bone diameter changes and cross sections encounter external trochanters or condylar ends, or for other reasons such as a variable medullary cavity. Klein *et al*. made a sequence of thin sections along the shaft of a dorsal rib of the marine reptile *Nothosaurus* [147] and found that *Cg* varied by ~35% within the rib (Fig 5). Although Fabbri *et al*. did not include this particular specimen in their study, nothosaurs make up a significant part of their dataset. *Cg* variation along the shaft of femora or ribs of other taxa has not been documented. While femoral sections are often taken at mid-shaft, no such standard exists for ribs.

Images from Klein *et al*. **[147: Fig 3, panels B1–B8]** were used to measure variation in *Cg* (~35%) along the rib shaft. Images supplied by Nicole Klein.

This intraspecimen variation of 35% in a *Nothosaurus* rib is nearly double the median variation of 18.6% measured above between individuals. Though it is based on only a single rib specimen, which may not representative, we found in our analysis above that *Nothosaurus* taxa have variation in femoral *Cg* that is much smaller than the median (Table 5).

Several recent studies have examined bone microanatomy variation within a long bone by making 3-D scans of the entire bone. Previous studies have usually assumed that amniote long bones have a relatively uniform, usually tubular, structure along the diaphysis, in which case a sample taken mid-diaphysis would fairly represent the majority of the bone [41,148–150]. However when Nakajima *et al*. studied the humerus of 52 species of turtle, they found considerable variation in 3-D bone microanatomy along the diaphysis—suggestive that there could be corresponding variations in *Cg*—but unfortunately reported only mid-diaphyseal *Cg* values [137]. Similarly, Houssaye and Botton-Divet imaged the humerus and femur from eight species of otter, found considerable internal 3-D differences in bone microanatomy along the diaphysis, but reported only mid-diaphyseal *Cg* [134]. Amson scanned the humerus of specimens of 164 taxa of extant and extinct therian mammals and helpfully reported *Cg* values at multiple points along the proximodistal axis [138]. To facilitate comparison of bones of different lengths, the study rescaled positions along the axis to fall within the range 0 (proximal end) to 1 (distal end). *Cg* values were reported for the range from 0.3 (*i*.*e*., a distance 30% of the length of the bone from the proximal end) to 0.7. Amson found that in many taxa, but not all, *Cg* is not quasi-constant along a tubular structure but instead tends to increase from the proximal to distal portions of long bones, suggesting a linear gradient of bone infilling. The slope of the gradient differs for aerial, aquatic, subterranean, and terrestrial mammals, suggesting that bone microanatomy details across a bone may have greater potential for lifestyle inference than a single point measure of *Cg* does. Amson reported the mean *Cg* values at the 0.3 and 0.7 scaled distance from the proximal end the humerus for each lifestyle group [138: Table 1].

From these average values, we calculated the variation in *Cg* (*i*.*e*., (max−min)/mean) between those two points in the same bone: aerial is 17.9%, aquatic is 15.6%, subterranean is 32.1%, and terrestrial 32.8%. With few exceptions, *Cg* varied considerably across different points of the same humeral specimen [138: Fig 2]. These results support Amson’s conclusion that “there is a rather consistent increase in global compactness along the diaphysis of therian mammals.” This effect would explain variation in different positions along the same specimen. It also suggests that using the *Cg* measured at a single point may not capture the bone ballast effect of the bone very well. If there is a linear gradient in *Cg* with different slopes for each lifestyle, the integrated effect of *Cg* on the whole bone mass would be systematically biased.

However, this study was limited both to mammals and to the humerus. Although the results are suggestive, we cannot confidently extrapolate them to other groups or other skeletal elements. The variable infilling of long bones, as well as variation along different locations in the same skeletal element, still present sources of uncertainty. Until these effects are quantified, caution is required for any analysis that relies on precise values of *Cg*.

#### Attempted replication of *Cg* values reported for spinosaurids.

Fabbri *et al*. reported new *Cg* values for spinosaurid taxa from measurements they made for their study [15]. We attempted to replicate those measurements by applying methodology from the literature (see Materials and methods). We also made a new measurement of *Cg* in a *Spinosaurus* specimen that was not included in ref. [15], but that they subsequently analyzed [16]. The results of our replication attempts provide useful examples of the extent to which variation in measurement contributes to the uncertainty of reported *Cg* values.

In addition to the biological, diagenetic, and taphonomic sources of variation described in the previous subsection, methodological differences in bone density determination can introduce variations in *Cg*. Relevant factors include the source type of bone section analyzed (CT digital scan, mounted thin section); the threshold value used to binarize a section image; and contour, masking, or repair steps taken prior to measurement of *Cg*. The many sources of variation increase the likelihood that independent researchers will obtain different quantitative values for *Cg* when deriving measurements from the same specimen or even the same cross section.

Fabbri *et al*. reported a very high *Cg* of 0.968 for *Spinosaurus*, a value that they calculated from a binarized image based on an image taken of a two-part thin section from the femoral shaft of the neotype skeleton [5]. That thin section, which was made by one of us (PCS), was taken on the narrow portion of the femoral shaft below the fourth trochanter (Fig 4C, section 5) and shows complete infilling of the medullary cavity. Fabbri *et al*. calculated a *Cg* value from a binarized image of this section that showed a small oval core of low density and an open (white) crack separating a portion of the cortex [15: Fig 1B,16: Fig 1A].

Inspection of the thin section under magnification reveals several details that are otherwise impossible to discern from whole or half thin-section images. First, there is no medullary cavity, despite a dark-stained region in the center of the bone shaft (Fig 6). The central core is entirely filled in with bone that is slightly more cancellous. Second, a vertically oriented dark-red zone to the right of the core, which shows up as a less-dense zone in the binarized image published by Fabbri *et al*., is an artifact of hematitic stain. We found no difference in the bone texture or density of this zone when we viewed it under magnification. Third, a crack separating a portion of the lower-left thin section occurred during production and mounting of the thin section. The gap created by the crack should be closed digitally prior to *Cg* measurement. The section was also cut into two pieces, creating the horizontal gap, which should also be filled before analysis.

(A) Transmitted light image of a two-part thin slice from the mid shaft. (B) Thin slice image modified to close gaps created by natural breaks. (C) Binary image and associated *Cg* value without filling the cracks and the gap between section halves. (D) Binary image and associated *Cg* after filling the cracks and the gap between section halves.

We found that accounting for these factors slightly elevates the *Cg* of the neotype femoral section to 0.998, our best estimate. This section shows an essentially fully infilled condition, whereas the binarized image reported by Fabbri *et al*. shows what appears to be an ovoid, less-dense core that results in their *Cg* estimate being approximately 3% lower than ours.

In response to our critique of Fabbri *et al*. [64], they incorrectly cited our response and introduced misinformation [16]:

Additionally, based on CT scan imaging, Myhrvold

et al.^{1}accuse us of ignoring a medullary cavity in the femur of the neotypic specimen ofSpinosaurusand that we are incorrectly oversampling bone tissue based on a thin section of the femur. As shown in Fig 1, cross sections obtained from the CT scan presented by Myhrvoldet al.^{1}lack adequate contrast and resolution, obscuring any details of its internal structure, contrary to the thin section used in our study.

We have not suggested at any point that there is a medullary cavity in the neotypic femur (FSAC-KK 11888), neither when the thin section was first published [5] nor as later discussed in our critique [64]. The femoral CT scan of FSAC-KK 11888 figured here (Fig 4C, section 5) shows a slightly lower density toward the core but no medullary cavity.

The more salient finding is that infilling of the medullary cavity of the femur in *Spinosaurus* is variable, as shown by a second specimen (CMN 41869) of similar body size from the same beds in Morocco [64]. A persistent reduced medullary cavity is exposed by fracturing of the shaft (Fig 4A and 4B) and has been visualized with a CT scan proximal to the break (Fig 7). The absence of matrix infilling of cracks or external erosion obviates the need for digital repair prior to measurement of bone compactness. To calculate *Cg*, we used Mimics to threshold and transform the 8-bit grayscale pixels (values from 1 to 256) in the original CT image (Fig 7A) to binary (value 0 or 1). We explored the choice of threshold value as a factor by generating *Cg* values from three images made using thresholding with lower-end gray values of 26, 31, and 36, which yielded *Cg* values of 0.888, 0.849, and 0.804 respectively, a relative range of 9.9% (Fig 7B–7D). As anticipated, the *Cg* values from CMN 41869 are significantly lower than the values of 0.998 and 0.968 that we and Fabbri *et al*. measured for the nearly solid neotypic femur (FSAC-KK 11888).

(A) CT section from the proximal end of the shaft. (B–D) Section images and corresponding *Cg* values after processing with gray-value (GV) lower thresholds ranging from 26 to 36 GV on a 256 GV gradient. Threshold values determine which pixels are regarded as bone versus nonbone; higher thresholds yield lower *Cg* values.

We selected the middle image (Fig 7C) as the best binary visualization because it registers the less-dense cancellous bone near the medullary cavity without also obliterating what appear to be vascular canals in adjacent cortex on the left and lower sides of the medullary cavity. Our *Cg* result of 0.849 is very close to the mean (0.847) of the three values obtained from our thresholding range. In this case, there is no physical thin section to examine under magnification in polarized light to verify what is bone or mineralized infill.

Although this CT-based femoral section (Fig 7A) was not available to Fabbri *et al*. for their initial analysis [15], they later reported its *Cg* as 0.941 [16] but did not present the binarized image that was used for *Cg* measurement. We are unable to replicate that value, even approximately. Apparently they employed more extreme thresholding than the maximum we considered reasonable (Fig 7B). Extreme thresholding would raise *Cg* by obliterating some of the smaller spaces in the binarized section. In this case, our *Cg* measurements on the same section differed within a relative range of 9.9%, due to subjective procedures used in preparation of binarized images, whereas the relative range between our measured *Cg* values and the value reported in [16] is 15.7%.

Nonetheless, it is clear from available specimens of *Spinosaurus aegyptiacus* that some individuals nearing maturity maintained a reduced medullary cavity with a femoral-shaft *Cg* under 0.900. We reported accurately on this variable condition of medullary cavities in the long bones and their presence in certain vertebral centra in *Spinosaurus* [64]:

A second femur of

Spinosaurus^{2}(Fig 1A and 1B), which is nearly identical in size to the infilled neotypic femur^{3}in their study (Fig 1C), has a significant medullary cavity lined with cancellous bone that would register as significantly less dense as a thin section at mid shaft. Medullary cavities are also variably present in forelimb bones ofSpinosaurus(Fig 1D) resembling those in the long bones ofSuchomimus, a fully “terrestrial” spinosaurid by their account. Fabbriet al.^{1:ED,}Fig 10 state thatSpinosaurusandBaryonyx“possess dense, compact bone throughout the postcranial skeleton,” yet all three have pneumatic spaces in their cervical column^{4}that exceed in volume the variable long bone infilling, as well as large medullary cavities hollowing the centra at the base of the tail. Neither of these features are present in any secondarily aquatic vertebrate divers that employ bone density as ballast.

Commenting on this new information on variability, Fabbri *et al*. introduced several errors [16]:

Myhrvold

et al.^{1}state that a single phalanx of the neotype ofSpinosauruspossess a medullary cavity, invalidating our inference of widespread osteosclerosis across the postcranium of this animal; we show here that a cross section of the phalanx lacks any medullary cavity, as previously described in Ibrahimet al.^{13–14}

and later:

Caudal vertebrae 1 and 4 of the neotype of

Spinosaurus: contrary to what suggested by Myhrvoldet al.^{1}, no pneumatization is present in the caudal region of this taxon.

We clearly described the *variable* presence of the medullary cavity in both fore and hind limb long bones in *Spinosaurus*, figuring the medullary cavity along the length of a manual phalanx from an adult individual as opposed to the subadult neotype (Fig 4D). We were aware of the infilled manual phalanges of the neotype. The image they republished of this infilled shaft condition was taken by one of us (PCS in [5]) from a break in the proximal shaft of a proximal manual phalanx, not at midshaft as they indicated [16: Fig 1C]. Medullary cavities are variably present in CT scans of a broader sampling of manual phalanges referable to *Spinosaurus aegyptiacus* from the Kem Kem Group. The centra of anterior caudal vertebrae in *Spinosaurus* and other spinosaurids likewise have a capacious medullary space that hollows the interior of the centrum, as we reported [14]. Contrary to Fabbri *et al*. [16], no one has claimed that the hollowed anterior caudal centra in various spinosaurids are pneumatic.

We examined CT cross sections from a third femur of *Spinosaurus aegyptiacus* from a very young individual, CMN 50382. The bone was collected in the same beds in Morocco as the first two specimens (Fig 8E). This femur, which measures only 11.8 cm in length [145], has a large medullary cavity extending along the length of its shaft and would indicate that the individual had a body length of approximately ~2.0 m. Ontogenetic infilling of the medullary cavity does not appear to have been initiated, with a midshaft *Cg* of approximately 0.695 (Fig 9).

(A) *Suchomimus tenerensis*, adult (holotype), length 107.5 cm (MNBH GAD500). (B) *Suchomimus tenerensis*, juvenile, length 54.6 cm (MNBH GAD72, reversed). (C) *Spinosaurus aegyptiacus*, subadult (holotype), length 61.0 cm (FSAC KK-11888). (D) *Spinosaurus aegyptiacus*, subadult, estimated length 61.0 cm (CMN 41869, reversed), (E) *Spinosaurus aegyptiacus*, juvenile, length 11.8 cm (CMN 50382, reversed).

(A) 3-D rendering and midshaft cross section generated from a CT scan, with binarized images (green) differing in their lower threshold gray-value setting. (B) Plot showing linear change of about 10% in *Cg* over the threshold range.

The cross sections in Figs 8 and 9 show that there is considerable variation in the morphology of the femora. While Fabbri *et al*. focused their analysis on a single *Cg* measurement from *Spinosaurus*, the biomechanical parameter relevant to the bone ballast hypothesis is whole bone density. In some taxa, it may be necessary to use a bone microanatomy metric for the whole bone, or from multiple sections, but to our knowledge these are not available in the literature and would need to be developed.

In the case of *Baryonyx walkeri*, only the distal one-third of the right femur of the holotype is preserved [4]. There is crushing inward of anterior and posterior intercondylar areas, leaving only a small section of the shaft available for estimating bone compactness. This portion of the shaft was CT-scanned. Fabbri *et al*. used three closely spaced cross sections across ~6 cm of the shaft to generate three estimates of *Cg* ranging from 0.826 to 0.876 (relative range of 5.9%). The two most complete sections generated the minimum and maximum *Cg* values [15: S3E and S3F Fig]. For the section generating the maximum value, the cracks had been infilled with solid bone and used for *Baryonyx* in their femoral datasets [15: Fig 1B].

We attempted to replicate the *Cg* estimate of 0.876 that Fabbri *et al*. reported for *Baryonyx*, using the scan of NHMUK 9951 available on Morphosource to create new sections near one of the CT scan sections they published [15: S3E Fig]. We prepared three CT sections across 1 cm of shaft (Fig 10) in the region of their preferred section. We also infilled the cracks with solid bone density. We prepared two options for removal of matrix from the medullary cavity, each binarized with three different threshold values. The first option attempted to replicate the exact shape of the medullary cavity they defined and removed (Fig 10A–10C, top row of each panel). As a second option, we examined the CT section and made an independent evaluation of the limits of fossilized bone, adjusting the medullary cavity boundary outward to include more material that did not show bone texture (Fig 10A–10C, bottom row of each panel). We see no positive evidence in the scan for cancellous or cortical bone in those areas; they look more like mineral infill. Both times we filled in the cracks, in an attempt to repair taphonomic damage to the specimen.

CT sections (A–C, posterior aspect of femur oriented toward bottom) were taken in successively in more distal positions across 1 cm of the distal shaft of the femur in the portion of the shaft used by Fabbri *et al*. [15: S3E Fig] for their best estimate of *Cg*. The small inset view shows distal end of the femur in medial view with mm distance from the bottom of the radiograph provided by Fabbri *et al*. To the right are three gray-value (GV) thresholds (left to right: 40–243, 36–243, 36–235) capturing a reasonable range of values that might be selected by researchers to binarize the radiograph. For each threshold, masking of the matrix infilling of the medullary space is shown in transparent (left) and binarized (right) views. Option 1 (top row) attempts to replicate medullary masking as published by Fabbri *et al*. Option 2 (bottom row) eliminates additional medullary material that we confirmed from the CT scan as matrix infill rather than cancellous medullary bone. Fabbri *et al*. reported a *Cg* of 0.876. The *Cg* range for our three slices in the vicinity of their preferred CT section using their masking is 0.873–0.887 (mean 0.880); their *Cg* measure is near the low end of that range. The *Cg* range with our masking is 0.767–0.778 (mean 0.773), considerably lower than their *Cg* measure.

We replicated their medullary cavity masking and found that their reported measure of 0.876 is near the low end of the range of *Cg* we obtained for our three sections (0.873–0.887). However, when we chose our own (slightly larger) masking for the medullary cavity, the range of values obtained (0.767–0.778) excludes their higher estimate for *Baryonyx*. Our mean *Cg* value for the distal shaft of *Baryonyx* (0.773) remains higher than the value reported by Fabbri *et al*. for *Suchomimus* (0.682), but that value seems artificially low. What seems clear at this point is that *Baryonyx*, like *Suchomimus*, retained an average-sized medullary cavity for a large theropod, the distal shaft of which generates a *Cg* less than 0.800.

Fabbri *et al*. figured two magnified thin sections for *Suchomimus tenerensis* identified as “G51” and “G94,” which are field numbers for the holotype (MNBH GAD500) and a referred subadult individual (MNBH GAD70), respectively [15: S2D and S2E Fig]. Neither of these specimens were sectioned, however, and MBNH GAD70 does not preserve more than the proximal end of one femur. We do not believe these thin sections pertain to *Suchomimus*.

One of us (PCS) made a four-part thin section from the distal end of an adult femur of *Suchomimus tenerensis* (MNBH GAD99, Fig 11), which has a length (107.5 cm) and distal condylar width (23 cm) identical to that of the holotype. The position of the section on the distal shaft is similar to that taken in *Baryonyx*. An image of this thin section was refigured as a binarized image by Fabbri *et al*. [15: S3D Fig], who reported a *Cg* of 0.682. We commented, after reexamining the bone, original thin section, and high-resolution images of the section, that there was additional cancellous bone not shown in their binarized image that likely lowered their reported *Cg* value [64].

(A) Composite image of the four-part thin section with an enlargement showing the complex relation between cancellous bone and dark-stained mineral infilling. (B) Cancellous bone (red) adjacent to dark-stained matrix in the core of the femoral shaft. (C) Digital removal of matrix adjacent to cancellous bone. (D) *Cg* from binarized image shown in (C). (E) Comparison of our binarized image (orange) to the *Cg* and binarized image published by Fabbri *et al*. (black) [15: S3D Fig]. (F) Final *Cg* after filling in matrix-filled cracks. (G) Distal femur showing position of thin section.

In their response to our preprint, they introduced misinformation without examining either the thin section or host bone [16]:

Myhrvold

et al.^{1}suggest that we underestimated bone density inSuchomimusduring the conversion of the femoral thin section into a black & white figure (the curating step prior to estimation of bone compactness), causing us to mis-identify bone as rock matrix. However, we did not apply our techniques blindly, but instead used careful observation to quantify bone compactness. As shown in Fig 1, the bone tissue in this specimen has a distinct white hue: Myhrvoldet al.^{1}conflate the mineral infilling surrounding the trabecular bone and bone tissue.

Color variations prevent proper evaluation of many thin sections, including those examined here, without stereoscopic or at least magnified examination. Our examination of the *Suchomimus* section in the original high-resolution images found clear evidence of differentially distributed cancellous bone invading the medullary cavity, especially in the lower two thin-section quadrants (Fig 11A and inset), contrary to Fabbri *et al*. [16]. We differentiated cancellous bone from adjacent dark-stained mineral deposits under stereoscopic magnification of the thin section (Fig 11B). After digital removal of mineral deposits (Fig 11C) and binarization of the image (Fig 11D), a *Cg* of 0.726 was obtained, which is 6% greater than that reported by Fabbri *et al*. (Fig 11E). We made that measurement to be fully comparable with the estimate reported by Fabbri *et al*. without repair of matrix-filled cracks, which also effectively lower *Cg*. We then repaired the cracks to measure our final best estimate of the *Cg* of this specimen of *Suchomimus*, which is 0.740 (Fig 11F), approximately 8% higher than reported by Fabbri *et al*. and only 4% less than our best estimate of *Cg* in *Baryonyx*. Distal femoral shaft sections in *Suchomimus* (Fig 11G) appear to have *Cg* greater than 0.700.

We also took a thin section from the midshaft of a femur from a juvenile *Suchomimus tenerensis* (MNBH GAD72) with femur length approximately half that of the adult. Our examination of that section finds a relatively large medullary cavity (Fig 8B, S2 and S3 Figs). *Cg* in the juvenile (MNBH GAD72) was found to be 0.689 to 0.699 (high and low thresholds).

Our replication experiments produced several notable results. We found that, depending on the specimen, subjective effects arising from the threshold selected to binarize the image, from the removal of matrix, and from repair of cracks and erosion can contribute sizable measurements variations in estimates of *Cg*, up to a relative range of nearly 10% in one specimen. Variations were lower for other specimens, and the nearly solid *Spinosaurus* neotype showed very little variation. Future studies should be mindful of subjective factors such as these because they introduce variation in *Cg* that can confound quantitative analyses that rely on precise *Cg* values.

### Issues with the application of pFDA

The sections above have analyzed the theoretical and statistical justification for using *Cg*, as well as some problematic issues that arise with its use by Fabbri *et al*. In this subsection we present results from our examination of a variety of aspects to the statistical calculations involved in pFDA that directly affect the quality and validity of results.

#### Effects of training-set sample size on classification.

A tacit assumption made by Fabbri *et al*. in their analysis, common to most statistical analyses in biology, is that underlying biological factors produce a true statistical distribution of the variates, and that the dataset classes reflect that distribution. The pFDA method, in particular, assumes that the true distribution for each class conforms to a multivariate normal distribution, or a close approximation thereof. But the parameters of those true distributions are *unknown*. The pFDA method must estimate the parameters from the finite sample in the training dataset. Although this situation is common to virtually all statistical analyses, it strangely seems to have been overlooked in the literature on pFDA, as well as in most biological applications of LDA, aside from a few exceptional examples [151].

Sample size, the number of distinct data points in the training dataset, is a key element determining the statistical power and precision of a pFDA analysis because it controls how well the finite sample approximates the underlying biological distribution. Neither Fabbri *et al*. nor any other pFDA study of which we are aware has offered any analysis of how the size of the training dataset affects classification accuracy.

The accuracy of binary classification has been long studied, and many mathematical metrics have been developed to measure it, including the metrics known as accuracy *A*, balanced accuracy *B*, the Matthews correlation coefficient *MCC*, and the true-positive and false-positive rates, all of which are defined and discussed in S1 Appendix, section 4.

We performed Monte Carlo simulations of an LDA classifier to explore sample-size effects for the case of symmetric multivariate distributions of the form given by Eqs (1) and (2). The results, shown in Fig 12, display noticeable scatter among the empirical centroids derived from groups of 59 pseudorandom data points (chosen to approximately match the count in ds1) (Fig 12A). The empirical centroids only roughly approximate the true centroids of the distributions from which they were drawn. The decision boundaries separating these groups also show considerable scatter in both midpoint and slope. Assessing the classification accuracy of 10,000 trials of two groups of 59 points yields a histogram, which peaks at the theoretical classification accuracy of 0.915 and shows considerable scatter (Fig 12B), with a 95% confidence interval spanning 0.831 to 0.941 (12% of the accuracy).

(A) Decision boundaries and centroids of point groups for 500 trials of 59 points drawn from a multivariate normal distribution of Eqs (1) and (2) with specified values of *σ* and *d*. The centroids of the distributions are shown by the magenta crosses; empirical centroids of each group of 59 points are black dots. The decision boundaries are red lines. Groups of 59 points are insufficient to accurately estimate the distribution centroid. The estimation error leads to variations in both the empirical-group centroids and the decision boundaries across Monte Carlo runs. (B) A histogram of the training-dataset classification accuracy is shown for 10,000 trials with the same parameters as (A). See Materials and methods for definitions of standard error *σ* and distance *d*. The theoretical accuracy for the values of *σ* and *d shown* is 0.885, but the 95% confidence interval extends from 0.831 to 0.941, a width of 12%. (C) Monte Carlo simulations of classification accuracy for point groups with *d* = 1.7 and varying values of *σ* and *n* points per group. Lines show an empirically derived relationship for the width of the 95% confidence interval in classification accuracy: CI width = *a*/*n*^{1/2}, where is *a* is a fitting constant determined for each value of *σ*. (D) Monte Carlo simulations of 10,000 trials of LDA plot the lower bound *A*_{LB} of the 95% CI for accuracy *A*. Each curve has a different value of the ratio *d/σ*, from 0.5 to 2.5, and illustrates how the lower bound changes as a function of the number of points in each group, which range from 10 to 2500. Abbreviations: CI, confidence interval.

Repeating this 10,000-run Monte Carlo experiment for multiple points per group 10≤*n*≤500 and for values of the standard-error parameter 0.707≤*σ*≤2.83, we find that the width of the 95% confidence interval on classification accuracy closely follows an empirically derived relation (Fig 12C). The general behavior is that the width of the confidence interval scales proportionately to 1/*n*^{1/2} for sample size *n*, as is typical for the normal distribution (and consistent with Eq (3)).

The lower bound of the confidence interval thus depends on both the ratio *d/σ* and the number of points in each group, as shown in Fig 12D. The dashed horizontal black line indicates the lower bound of the 95% CI *A*_{LB} = 0.975, which is the value of accuracy associated with *P*_{rand} = 0.05, a heuristic criterion for no more than 5% random error in classification (see Eq (7) of S1 Appendix, section 4). Eq (3) and the relationship between classification error and *A* (Materials and methods) can be inverted to show that *A* = 0.975 when *d/σ* = 1.96, which is the theoretical accuracy value in the infinite limit of group size, at which point the width of the 95% CI would be zero, so the estimate and lower bound are the same. If *d/σ*<1.96, then even an infinite number of points will not achieve *A*_{LB} = 0.975.

The curves in Fig 12D show that the number of points in each group used for LDA has a strong dependence on both the point count and *d/σ*. The curve representing *d/σ* = 2.0 (green curve in Fig 12D) asymptotically approaches *A*_{LB} = 0.975. The rightmost point plotted in this curve shows that for *n* = 2560 points per group, it has reached *A*_{LB} = 0.973. A greater ratio of *d/σ* = 2.5 dramatically changes the sample size necessary; *A*_{LB} = 0.975 for as few as 20 points per group (purple curve in Fig 12D).

Although these examples illustrate the threshold *A*_{LB} = 0.975, qualitatively similar behavior occurs for any *A*_{LB} threshold. Eqs (3) and (7) allows us to solve for the value of *d/σ* that reaches a selected threshold value of *A* in the limit of an infinite point count. For values of *d/σ* slightly above the value predicted by Eqs (3) and (7), we may need a large (but finite) number of datapoints. However, for values of *d/σ* that are larger than that value, a much smaller number of datapoints is needed.

This pattern is an example of the “ecological fallacy,” a common pitfall in statistical inference. Briefly stated, one generally cannot accurately classify a point by comparing it to its statistical distribution or to the average and variance derived from the distribution. In specific cases, the classification may succeed, but only if the variance of the distributions is sufficiently small. The ecological fallacy is discussed further in S1 Appendix, section 1. LDA, FDA, and pFDA are all generally subject to the ecological fallacy; the methods do not guarantee that the classifications they produce will be meaningful, no matter how large the datasets used. Only if *d*/*σ* is sufficiently large *and* the sample is sufficiently large will classification performance be adequate. These two factors interact to produce the behavior seen in Fig 12C.

These results pertain to classification accuracy of the training dataset, but a similar phenomenon occurs for any metric of classification performance. Classification accuracy is linear in the confusion-matrix components, whereas some metrics, such as *MCC*, are nonlinear in the components (S1 Appendix, section 4). The exact form of the relation between 95% CI width and the distribution parameters will thus change, but we expect the overall behavior to be qualitatively similar. In actual practice, we do not know the exact distribution and instead have only the finite sample to work with. Another complication is that pFDA, as used in Fabbri *et al*., has an additional source of random variation due to the creation of randomly generated phylogenetic trees.

We applied a bootstrap approach [27] to the datasets used by Fabbri *et al*. as a way to estimate the finite sample-size effects on pFDA. Implementation details are presented in Materials and methods.

Fig 13A presents the results of this process for 10,000 trials (100 trials, each replicated with 100 random trees). Each decision boundary has a corresponding classification of the points, yielding a confusion matrix, from which we then calculated the classification accuracy for the training dataset. Fig 13B plots the distribution of accuracy values as a histogram. The bootstrap samples also affect the posterior classification probabilities *P*_{2} (Fig 13C) for the *F0D2* group. *P*_{2} is the predicted probability that the taxon should be classified as a “subaqueous forager.” Our results from the bootstrap analysis are qualitatively unsurprising, in view of our results on the simplified synthetic dataset (Fig 12). The effect of small sample size leads to scatter in both the group centroids and the decision boundaries.

(A) Decision boundaries (red lines) and point-group centroids (black dots) for 100 trials created using a bootstrap method described in the text, operating on the *F0D0* and *F0D2* subsets of the Fabbri *et al*. femoral dataset ds1. Each bootstrap trial draws 100 trees at random, each with its own decision boundary. Even more than in Fig 12, considerable scatter is evident in both the centroid positions and decision boundaries. Data points for *Spinosaurus*, *Baryonyx*, and *Suchomimus* are plotted in blue. The downward slopes of most of the decision boundaries, as well as the leftward offset of the centroids of the *F0D0* subset versus that for *F0D2*, show the effect of lower *MD* for *F0D0*. (B) A histogram of classification accuracy of the training dataset is shown for 2000 trials of a parametric bootstrap. The median training-set classification accuracy is 0.795, and the 95% confidence interval is 0.718 to 0.863, a width of 0.163. (C) Histograms of *P*_{2}, the posterior probability of belonging to group *D = 2*, for spinosaurid taxa across 2000 trials of the same dataset as (A) and (B). (D) Histogram of training-set classification accuracy similar to (B), but for rib data. The median training-set classification accuracy is 0.821, and the 95% confidence interval is 0.696 to 0.884, a width of 0.188. Abbreviations: *Sp*, *Spinosaurus*; *Ba*, *Baryonyx*; *Su*, *Suchomimus*; CI, confidence interval; BCa, bias-corrected-and-accelerated method.

To estimate 95% confidence intervals, we performed 2000 bootstrap trials for each dataset and tabulated the training dataset errors. The bias-corrected-and-accelerated (BCa) method was used to assure good results on the confidence interval [27]. Because each case also has 100 random trees, 200,000 results were used for the creation of the confidence intervals (Table 7). The training-set error rate is widely considered to be overoptimistic, and our use of it is thus very conservative.

The primary effect of sample size is that the 95% confidence interval is much broader than the point estimates. The importance to the interpretation of pFDA classification results is that they are even more uncertain than one would expect from simply evaluating the training-set error from a single run. Consider dataset ds1: using the classification scheme of Fabbri *et al*. (Table 7 columns “Many vs. *F0D2*”), the median value of the accuracy metric *A* (Eq (6) in S1 Appendix, section 4) is 0.836 (83.6%), which is roughly consistent with their claim that the correct classification rate is “84–85% (femora)” [15]. However, the 95% CI for this value ranges from 77.1% to 89.4%. When we used the better-supported method of comparing the *F0D2* subset to just the *F0D0* group, we found that the median accuracy drops slightly to 79.5%, with the lower bound of the 95% CI falling to 71.8%.

Because half of the time the classifier performs worse than the median accuracy, a better way to characterize the minimum performance is to use the lower bound of the 95% CI, *i*.*e*., the minimum accuracy that allows us to be certain (to 95% confidence) that random effects will be no greater than 5%. Using the method of calculating the equivalent percentage of random trials *P*_{rand} (Materials and methods, Eq (4)), we find that an accuracy of 71.8% is equivalent to saying that the classification accuracy is correct 43.6% of the time, and completely random 56.4% of the time (*i*.*e*., *P*_{rand} = 0.564). That is certainly better than a random guess all of the time. However, a method that produced the correct result only about half the time and random results the other half would not normally be considered as sufficient scientific evidence to draw a valid conclusion—it would be too contaminated by random effects. Typically used statistical thresholds for random effects are 5%, an order of magnitude lower.

None of the 95% intervals in Table 7 would result in a value of *P*_{rand}<0.05, which is the heuristic value that would correspond to about 5% random errors (see Assigning confidence to classifications in the Methods and materials section). Indeed, the highest median value found for *A* in the *F0D0* versus *F0D2* case is ds4; the lower bound on the 95% CI in that case is 0.877, corresponding to *P*_{rand} = 0.246, *i*.*e*., equivalent to a situation in which the classification is random 24.6% of the time. We must caution however that, as discussed in Materials and methods, *P*_{rand} is not equivalent to a formal *P* value. Instead, it is a heuristic that tells us that the training-set accuracy is equivalent to a case where the result is random with probability *P*_{rand} and correct the rest of the time.

The *P*_{2} metric of classification performance shown in Fig 13 is a metric of classification strength. Whereas *P*_{rand} measures overall classification performance, *P*_{2} is specific to an individual taxon. The bootstrap approach we used generates a distribution of *P*_{2} values (Fig 13C), along with its associated 95% confidence.

To assess the impact of variations in the data for the spinosaurid test taxa, we performed a sensitivity analysis, using hypothetical variants of the data points used by Fabbri *et al*. (Table 8 and S4 Fig). To be clear, we do not propose that these modified values are necessarily more correct or believable; the aim of the sensitivity analysis is to see how *P*_{2} for each spinosaurid taxon is affected by changes of various magnitudes to its test data point. Sensitivity analysis of this kind is a routine way to evaluate statistical classifiers.

The variations cover three principal approaches. The first covers the maximum diameter *MD*. The *Spinosaurus* neotype has been estimated at 72% of full size. On allometric grounds, one would expect that *MD* would therefore scale by a factor of 1.64 = (1/0.72)^{1.5}. This factor follows from the assumption that body mass scales as the cube of linear size (*i*.*e*., isometrically), while *MD* scales as the square root of load. This scaling is not relevant for the other spinosaurids in the analysis, many of which are subadults short of maximum size but not juveniles. However, our scanning of an adult *Suchomimus* femur MNBH GAD500 (S3 Fig) did reveal a quite different maximum diameter (146.4 mm) than that reported by Fabbri *et al*. (120.6 mm), so we use our value as a variation.

The values of *Cg* are also varied based on our attempts to replicate the measurements of Fabbri *et al*. using CT scans, as discussed above and shown in Figs 7, 10 and 11. These hypothetical points are of clear relevance—they are the data points for the taxa that would occur if Fabbri *et al*. measured the *Cg* values the same way we did, or if the specimen had a different but plausible value of *MD*.

In the case of the rib data, we did not have alternative measurements and instead considered a hypothetical scaling of *Cg* by 0.9 (equivalent to a 10% reduction). Seeing as the median percent difference in *Cg* found for multiple specimens of the same taxon (Tables 5 and 6) is 18.6%, we consider this 10% variation to be quite conservative. The 25th percentile of Tables 5 and 6 together is 12.1%, so the hypothetical value of 10% is less than three-quarters of the individual variations. The results of our analysis on the effects of finite sample size on *P*_{2} are presented in Table 9. Example plots of the bootstrap distribution and confidence intervals for the *Spinosaurus* cases are shown in S5–S7 Figs.

The *P*_{2} results for the original Fabbri *et al*. specimen data for *Baryonyx* (*Baryonyx 0* in ds1, *F0D0* vs. *F0D2* and *Many* vs. *F0D2*, *Baryonyx 0* in ds2, *Many* vs. *F0D2*), and for *Spinosaurus* (*Spinosaurus 0* in ds1, *F0D0* vs *F0D2* and *Spinosaurus 0* in ds2, *Many* vs. *F0D2*) each fail to meet our heuristic criteria that the lower bound of the 95% CI for *P*_{2} is greater than 0.95. Finite-size effects thus undermine the conclusion that there is strong support predicting “subaqueous foraging” for these taxa. As found in the LDA cases of Fig 12, this could be due to intrinsic variation (*i*.*e*., *d/σ* too small), too few data points, or a combination of both.

The sensitivity analysis of hypothetical variations for the spinosaurids results also show that *P*_{2} is highly sensitive to small changes in the values of data points. In the case of *Baryonyx* and *F0D0* vs. *F0D2*, reducing *Cg* to the median value found in Fig 10 (*i*.*e*., *Baryonyx 2* ds1) causes the lower bound on *P*_{2} to drop from 0.83 to 0.76. Using the low *Cg* value from Fig 10 (*i*.*e*., *Baryonyx 3* ds1) results in that lower bound falling further to 0.59. In the *Many* vs. *F0D2* comparison, the latter case results in a still lower *P*_{2} bound of 0.57.

Similar effects are seen in the ds2 (rib) datasets: *Baryonyx 0* has a lower bound of 0.97 for *F0D0* vs. *F0D2*, but this drops to 0.86 in *Baryonyx 1*, in which *Cg* differs by only about 10%. These results show that relatively small variations in *Cg* can shift the expected value of *P*_{2} from significant to dubious.

Qualitatively similar results hold for the variations in *Spinosaurus* and *Suchomimus*. None of the *Spinosaurus* or *Suchomimus* data points, original or variants, ds1 or ds2, have lower bound on *P*_{2} ≥ 0.95 in both *F0D0* vs. *F0D2* and *Many* vs. *F0D2* comparisons. The results for *Spinosaurus* show that the seemingly stringent test of the lower bound of the 95% CI for *P*_{2} ≥ 0.95 can be met, but only for ds2 data and the *F0D0* vs. *F0D2* comparison in the cases of *Spinosaurus 0* and *Spinosaurus 2*, or for ds1 data and *Many* vs. *F0D2* comparison in the cases of *Spinosaurus 0* and the variants numbered 4, 5 and 9.

These sensitivity results show the risk inherent in classifying an entire taxon on a single datapoint (here, either *MD* or *Cg*). The sensitivity analysis shows that the outcome of pFDA can hinge on the precision to which one or both of the data values are known.

The overall result of our analysis of finite-size effects, which occur due to the relatively small number of training datapoints (relative to the variance in those data), greatly reduces our confidence in the key parameters of training-set classification performance, such as *A* and *P*_{2}. Our sensitivity analysis shows that even small variations in *Cg* (10%, or less in the case of some of the values from our attempted replication) can have a decisive effect on *P*_{2}, and thus on classification.

#### Verification of distribution assumptions for pFDA.

Statistical methods have validity only if they are applied to datasets that match the assumptions used in developing the method. Normal statistical practice is to test those assumptions, but Fabbri *et al*. did not report such tests. Here we perform several simple tests of their data distributions.

The pFDA method is based on FDA and LDA, which were originally derived for multivariate normal distributions (Materials and methods). However, we find that the distributions of (log_{10}(*MD*), *Cg*) points cannot closely follow a normal distribution in the *Cg* axis because normal distributions are defined on the open interval (−∞, ∞), whereas *Cg* is restricted to the closed interval (0, 1]. The ds1 and ds2 datasets include many values near the top of the range 0.9≤*Cg*<1. As a result, a normal distribution fit to those data will inevitably have a fictitious tail in which the probability density is nonzero for impossible points that have a *Cg*>1. Allocating probability density to illegal values inevitably harms the fit to the distribution elsewhere, even after adjustment for phylogenetic bias.

We tested the assumptions about distributions directly by examining the discriminant values generated by the pFDA algorithm, as described in Materials and methods. The discriminant values, which have already been corrected for phylogenetic bias correction and reduced in dimensionality, are directly used to calculate posterior probabilities. It is therefore an important requirement of the method that their distribution is normal. Fig 14 plots the smoothed kernel distributions that we derived from the discriminant values.

Smoothed kernel distributions for the pFDA discriminants vs. pFDA normal distributions from the (A) femur ds1 and (B) rib ds2 datasets of Fabbri *et al*. [15]. For both datasets, distributions of the discriminants from the *F0D0* and *F0D2* subsets (filled areas) differ from the normal distributions imposed by pFDA (dashed curves), which have different means but the same variance. The distributions from the *F0D0* and *F0D2* groups also overlap considerably in both datasets.

Our results show that the discriminants do *not* closely match a normal distribution, particularly when compared to the normal distributions used by pFDA (dashed lines in Fig 14), which are fit to the variance of both classes simultaneously, rather than the best fit to each class. Fig 14 also shows that the discriminants for two groups show considerable overlap, indicating high classification error in the training sets. The overlap in distributions we find here is consistent with the high degree of overlap we demonstrated in the original datasets (Fig 1C and 1D), and with additional analysis we performed using simple effect-size statistics (S8–S10 Figs). We find that correction for phylogenetic bias does not eliminate the overlap between groups, which is to be expected given the very low values of Pagel’s λ found by Fabbri *et al*. The overlap between the *F0D0* and *F0D2* classes is a clear example of the ecological fallacy (S1 Appendix, section 1).

To quantify the deviation of the discriminants from normality, we made maximum-likelihood estimates of the best-fitting distributions, including standard continuous statistical distributions as well as mixtures of them. Table 10 presents the parameters of the four best-fitting distributions, as well as the normal distributions assumed by pFDA. For a given dataset, pFDA assumes a normal distribution with a different mean for the *F0D0* and *F0D2*, but a standard deviation parameter that is pooled between them. These distributions are plotted in Fig 15.

(A, B) Histograms (light blue) of the discriminants for the *F0D0* and *F0D2* subsets of ds1. The pFDA normal distributions (black dashed curves) were computed with a standard deviation parameter pooled across *F0D0* and *F0D2*. The three best-fitting distributions (Table 10) are shown in red, dark blue, and green. (C, D) Comparable plots for ds2.

Table 11 compares the fits for the distributions of Table 10 to the data by displaying the corrected Akaike information criterion AICc weights of each fit. The Akaike weights *W* can be interpreted as the relative likelihood of each model being best fit [34,152]. We also normalized the weights to generate the relative probability *P*_{dist} of each distribution fitting the data.

Using the standard criteria that ΔAICc<2 indicates strong support, we find that a fit to the uniform distribution is the only choice among the four tested that is strongly supported for ds1 *F0D0*. The normal distribution used by pFDA is not supported and has a low Akaike weight and a *P*_{dist} = 2.26×10^{−5}. For ds1 *F0D2*, we found that the pFDA normal, best-fit normal, and uniform distributions all receive strong support under AICc, with the best-fit normal having the strongest support and the pFDA normal the lowest, with *P*_{dist} = 0.29.

The ds2 dataset results show that for the *F0D0* subset, the best-fit normal distribution is the only choice that exhibits strong support under AICc. There is no support for the pFDA normal distribution, which has the lowest *P*_{dist} = 0.01. The ds2 *F0D2* subset is best fit by the best-fit normal, with *P*_{dist} = 0.44, and next the logistic distribution, with *P*_{dist} = 0.39; no other distributions receive strong support, and for the pFDA normal distribution *P*_{dist} = 0.04.

Because pFDA requires both classes in a dataset to fit a normal distribution, the probability that a dataset meets the criteria for comparing classes composed of *F0D0* versus *F0D2* is the product of the *P*_{dist} values for the pFDA normal distributions of each class. For the ds1 dataset, that overall probability is (2.26×10^{−5})×0.29 = 6.50×10^{−6}, which is very low. For the ds2 dataset, the probability is 2.84×10^{−4}. These results show that both datasets have a low probability of being best fit by the pFDA normal distributions.

Our conservative interpretation is that the distributions do not clearly match the distributional assumptions of pFDA. The results also support a less conservative interpretation: that the datasets are insufficient to clearly and convincingly meet the normal distribution requirement of pFDA, but normality equally cannot be ruled out for some cases. Although a normal distribution fit to each class does have some support in ds1, there is no strong support for ds2 (Table 11).

If the *F0D0* and *F0D2* subsets do not have the same variance—a fundamental prerequisite of both LDA and the subset of FDA used by pFDA—then a final determination of the outcome of normality becomes a moot point. We performed three conventional variance equivalence tests, taking care to choose tests that are robust to deviations from a normal distribution. The test results show that we cannot reject the null hypothesis of equal variances for ds1, but we can reject it for ds2 (Table 12).

In reviewing the results of the distribution fit tests in Tables 10 and 11, we were surprised to find that the uniform distribution is the only distribution for the ds1 *F0D0* subset that has support under AICc. A uniform distribution represents a fundamental challenge to the pFDA paradigm because, like FDA and LDA before it, the pFDA method is based on the assumption that each class has a centroid that is the most probable location for the datapoints of that class. A point is classified by its relative distance from the centroid of each class, as weighted by the normal distribution.

If instead the datapoints have a uniform distribution, then there can be no classification at all because the probability of class membership for a uniform distribution is independent of distance from the centroid. Thus, if a uniform distribution accurately describes any one or more of the classes in a pFDA analysis, classification is impossible. The performance of the uniform distribution in the *F0D0* class of the ds1 femoral dataset indicates that this may be true of that dataset (Tables 10 and 11).

To investigate this question further, we followed standard statistical practice in clustering and classification problems and used the Hopkins statistic to assess whether the points exhibit genuine clustering [36,48,49,153]. Unlike the distribution tests for the discriminants, the Hopkins statistic can be directly computed on the original 2-D data points.

The null hypothesis under this test is that the data points are distributed with a uniform random distribution. Failure to reject the null hypothesis implies that any apparent clustering is likely illusory and attributable to random chance. Table 13 shows the results of our application of the standard Hopkins statistic, as well as two variations by Lawson and Jurs [36] and Fernández Pierna and Massart [35], to the *F0D0* and *F0D2* subsets of both ds1 and ds2 (Materials and methods).

In each case, and for each variation of the Hopkins statistic test, we find that we cannot reject the null hypothesis. The datasets are thus *statistically indistinguishable from a uniform random distribution* in the (log_{10}(*MD*), *Cg*) space under the various Hopkins statistic tests. This is true both for the original, untransformed datasets as well as for those that have been phylogenetically corrected using the same optimal values of Pagel’s λ found by Fabbri *et al*. This result is visualized in Fig 16, which shows as one example a plot of *F0D0* from ds1 compared to a uniformly random distribution that has been clipped to the same convex hull.

(A) Data points for the terrestrial (*F0D0*) group (black dots) in the femoral dataset of Fabbri *et al*. are plotted along with (B) uniform random points (red dots), both clipped to the convex hull enclosing the *F0D0* data. The apparent absence of any nonrandom concentration or clustering of the *F0D0* data is confirmed by statistical tests (Table 13).

The relatively strong performance of mixture distributions in Fig 15 and Table 10 suggests that, for some datasets, the distributions might be bimodal. However, the Hopkins-statistic results suggest (but do not prove) that apparent bimodal behavior in the discriminants may be an artifact of the low data count; strongly bimodal distributions would show clustering under the Hopkins statistic.

A uniformly random distribution of data points may seem strange, but biologically this corresponds to the points in (log_{10}(*MD*), *Cg*) space being equally likely, at least within some range of values in each parameter. This finding does not falsify the bone ballast hypothesis, which holds that some secondarily semiaquatic adapted taxa will have increased *Cg*. That hypothesis does not specify that the absolute increase must be the same for all aquatically adapted taxa. On the contrary, we would expect the increase in *Cg* to depend on multiple ecological constraints, so the increase should be judged relative to terrestrial sister taxa. Different clades of terrestrial taxa may display a diversity of “typical” *Cg* values [54].

Although increased bone density affects buoyancy, *a priori* we would expect that the optimal body buoyancy depends on taxon-specific factors, such as the typical depth at which an animal operates when underwater [50,54]. The range of depths for which a given taxon is optimized depends on their local environment; it need not be the same for all secondarily aquatic taxa.

Thus, the assumption that the distribution of *Cg* must have a peak value and drop off like a normal distribution from that peak value is arbitrary and not part of the bone ballast hypothesis. Our review of the literature found no prior work suggesting any specific features or properties of the statistical distribution of *Cg* across a broad set of clades. It therefore seems entirely possible that the bone ballast hypothesis holds, even though *Cg* is not normally distributed within each class.

Uniformity in the distribution could have arisen accidentally or been enhanced by choices during dataset construction. Various confounding factors might have led to the subsets mixing taxa that do and do not have increased *Cg*, for example. An attempt to sample a diversity of values of *Cg* and *MD*, covering a range a body sizes, could unintentionally bias selection of taxa for the dataset toward greater spread and less clustering, thereby making a uniform distribution more likely.

The method used to correct phylogenetic bias might also have come into play. If clustering of values in (log_{10}(*MD*), *Cg*) space naturally occurs among closely related taxa, then the phylogenetic correction could deemphasize those clusters. While that is the desired effect of removal of phylogenetic bias, it could have the unintended consequence of pushing the dataset toward a uniform random distribution.

Arguably the simplest explanation for the results shown in Table 13 and Fig 16 is low sample size. Datasets that use 49 to 62 points across many clades may simply be too small to show evidence of clustering. Though we offer these general observations, it is beyond the scope of the present study to quantify in detail how the factors above might apply to the datasets under examination.

### Interpretation of lifestyles of extinct “subaqueous foragers” and “nondivers”

When the aim is to discern lifestyle in extinct species, researchers have in the past restricted pFDA training datasets to extant taxa whose lifestyles have been observed. The recent study by Fabbri *et al*. is, to our knowledge, the sole exception to that approach. Their training datasets specify the lifestyle of many extinct species, scoring them as nondiving or as rarely or frequently diving “subaqueous foragers.” For taxa with flippers, such as *Plesiosaurus*, we find this interpretation a reasonable extrapolation based on morphology and paleoenvironment of fossilization. For other taxa, however, such as the extinct hippopotamus *Hexaprotodon garyam*, considerable uncertainty remains regarding its habits in water, as it has fewer secondary aquatic adaptations than the common hippo [132], which forages in terrestrial environments.

Our examination of the datasets found that they do not similarly extrapolate the lifestyle of extinct species that have been long interpreted as fully terrestrial. A large subset of such taxa, 37 nonavian dinosaurs, were scored as nonflying reptiles with “unknown” diving capacity (*F =* 0, *D =* unknown). All nonavian dinosaurs in the analysis were thus treated as “unknown” as to diving status, including *Stegosaurus*, despite its elephantine feet [154]; *Oviraptor*, which is known to have lived and nested in xeric habitats far from any shoreline [155]; and *Alamosaurus*, which had columnar limbs discovered in inland terrestrial deposits [156].

We find the scoring method to be arbitrary, as it scored in advance nearly all subaqueous foragers yet remained blind to well-supported habits of nonspinosaurid nonavian dinosaurs, all of which have long been regarded as fully terrestrial [157]. We find the categorization of these taxa as “unknown” for diving to be a major reason that the terrestrial subset *F0D0* consists almost entirely of extant species.

## Conclusions

The purpose of our study was twofold. First, to contribute to the ongoing debate about the lifestyle of spinosaurids by carefully reexamining the data and methods employed by Fabbri *et al*. in their recent study of the question [15]; we did so at multiple levels and also attempted to replicate some of the measurements and results they published. Second, and perhaps more important, we aimed to identify general issues with the use of pFDA and bone microanatomy metrics such as *Cg* in paleobiology in order to guide future applications of this method and research into ways to improve its utility.

### Conclusions about the results reported by Fabbri *et al.*

The results of our reexamination show that the data and methods of Fabbri *et al*. do not support their conclusion that *Spinosaurus aegyptiacus* and *Baryonyx walkeri* were fully submerged “subaqueous foragers,” whereas *Suchomimus tenerensis* was not. We find that the datasets, groups, and classes they used to compare habitual fully submerged predation to all other lifestyles were constructed in such a way that they cannot be used for accurate classification. The classes show extensive overlap with no division boundary (Fig 1), mix different kinds of foraging behavior, reflect imbalanced choices of extant and extinct taxa (Table 4), include redundant specimens for a few selected taxa, and show a bias toward inclusion of small-bodied exemplars and omission of large-bodied terrestrial taxa more comparable to spinosaurids. We show that in their secondary analysis, Fabbri *et al*. used anatomical criteria to cull “graviportal” and “deep-diving” taxa and then applied those criteria inconsistently, thereby introducing a selection bias in *Cg*.

We identified numerous problems with their choice and use of *Cg* and maximum bone diameter *MD* as the sole variables in a pFDA analysis. We show that *MD* should not have been included as a variable because the ANOVA results reported by Fabbri *et al*. shows that *MD* substantially reduces the explanatory power of the model. We find a worrisome disparity in *Cg* between extinct and extant taxa in their datasets (Table 3), which—coupled with extreme differences in the number of extinct and extant taxa in each class—could bias classification and undermine an assumption of the study. We document many examples of individual variation in *Cg* measured from both extinct (Table 5 and Fig 5) and extant (Table 6) taxa and find that the degree of such variation could account for a majority of the differences Fabbri *et al*. reported between their classes. We describe several biological and taphonomic factors that could lead to such variations (Figs 8 and 9), not only among individual animals but even within single bones (Fig 10). Our attempt to replicate specific measurements of *Cg* from spinosaurid fossils reported by Fabbri *et al*. demonstrates that measurement variation (Figs 7 and 10) and error (Fig 11) also contribute uncertainty that they did not account for in their analysis.

Our examination of how Fabbri *et al*. applied pFDA also reveals several serious statistical problems. Samples sizes are of crucial importance to any statistical analysis, and we find that the datasets and subsets in their study were too small, given the observed differences in *Cg*, to demonstrate results that meet broadly accepted standards of statistical significance (Fig 14 and Table 7), especially when considering issues of biological and measurement variation noted above (Tables 8 and 9). Finally, but perhaps most conclusively, we analyzed the statistical distributions of the data used by Fabbri *et al*., which must conform to normal distributions of equal variance to meet the prerequisites of the pFDA method. We demonstrate that the best-fit distributions to the datasets are not all normal (Figs 14 and 15, Tables 10 and 11), and that their variances are not equal (Table 12). Remarkably, our tests reveal that the datasets are not statistically distinguishable from a uniform random distribution (Table 13 and Fig 16), and we consider a number of plausible factors that could have caused the data to exhibit such scatter.

Many of the results above would be sufficient grounds on their own to question the validity of the conclusions Fabbri *et al*. made about spinosaurid behavior. The unusual constellation of so many different problems allows us to confidently dismiss those findings.

### Conclusions about spinosaurid ecology and lifestyle

Our study did not aim to determine the ecology and lifestyle of *Spinosaurus* and its relatives *Suchomimus* and *Baryonyx*, and our results by themselves do not settle the debate or add new independent lines of evidence about this question. Fabbri *et al*. have highlighted the high *Cg* values in *Spinosaurus*, consistent with the 2014 observation that the Kem Kem specimen has dense, “nearly solid” femora [148]. They have shown that *Baryonyx* also has moderately high *Cg*. We show that in both cases there is some uncertainty; an isolated femur fragment attributed to *Spinosaurus* has a medullary cavity and a much lower *Cg*. With so few specimens of these taxa discovered, conclusions about what is typical are speculative at best.

We find it very unlikely that the high *Cg* values observed in these taxa result from the bone ballast hypothesis. Multiple independent lines of evidence have shown that *Spinosaurus* was unsinkable and too unstable to swim or float, due to its extensive axial pneumaticity [8,14]. The very large body mass of this species offers one obvious alternative explanation, as the correlation of *Cg* to body mass has been well documented in the literature. But the variable infilling observed in the few available specimens greatly limits our ability to draw broad conclusions. Additional lines of evidence [8,13,14] independent from those covered in this study contradict the aquatic pursuit predator hypothesis for *Spinosaurus* [11].

In the present work, we document several ways in which the hypothesis is inconsistent with literature on the bone ballast hypothesis, which tells us that high skeletal *Cg* is more common in slow swimmers or bottom walkers [50,54], including sirenians, sea otters, and hippos. The aquatic pursuit predator hypothesis is also inconsistent with the finding that diving birds, which include fast pursuit predators both with and without flight, tend to exhibit reduced or nonexistent postcranial pneumaticity [101], neither of which have been observed in *Spinosaurus* [14]. But those points are only suggestive—it is the biomechanical evidence of buoyancy, drag, and stability [8,13,14] that together make the strongest case again the aquatic pursuit predator hypothesis.

In the absence of new ideas or new specimens, we conclude that the best current evidence for *Spinosaurus* ecology and lifestyle is marshalled by the most recent papers [8,13,14] that promote the idea of *Spinosaurus* as a semiaquatic piscivore but not an aquatic pursuit predator, as reviewed above in the overview of prior studies on *Spinosaurus* lifestyle. Similarly, the lifestyle of *Baryonyx* [2] and *Suchomimus* [3] is best covered by the earlier work on *Baryonyx* [4].

### Conclusions about the use of *Cg* and pFDA in paleobiology

The bone ballast hypothesis has been considered for many decades, and it remains an important anatomical observation about skeletal adaptation to lifestyle. We found nothing in our study to contradict the idea and evidence that increased bone density (and its proxy, increased *Cg*) are found in taxa with certain specific semiaquatic or aquatic lifestyles, especially slow-swimming taxa such as aquatic herbivores, predators of shellfish, or similarly sessile prey [50,54]. We find the evidence that fast swimmers and active pursuit predators have generally lower bone density and *Cg* [50,54] to be consistent with the bone ballast hypothesis as well.

However, the relationship between *Cg* and semiaquatic or fully aquatic taxa via the bone ballast hypothesis is complex, and the hypothesis has its limits [52]. Bone density and *Cg* are potentially increased by other attributes and lifestyles of a taxon. Of the many potential confounding factors, the most relevant to spinosaurids is large body size, which has been associated with high *Cg* independent of semiaquatic adaptations [50,52,54,107,132]. The example of extant hippos shows that we may not be able to distinguish between these two effects [132].

Bone ballast measured via *Cg* sampled only from long bones or ribs may be sufficiently diagnostic for reptiles or mammals, but such data are not sufficient for classifying birds or nonavian dinosaurs, which have extensive pneumaticity [99]. In these taxa, the negative buoyancy effect of dense ribs or femora is at least partly offset by the positive buoyancy of air sacs found in bird’s paraxial pneumaticity. New techniques, including 3-D models that include flesh and air sacs, may be needed to supplement data on bone microanatomy metrics.

Our review of the literature investigating the bone ballast hypothesis found that this has *not* previously been proposed as a universal rule across amniotes. Prior studies instead qualify it is as a phenomenon found in only in specific niches [50,54]. Despite this, Fabbri *et al*. made a tantalizing assertion in multiple places in their study that their findings do hold across all amniotes—or alternatively among all amniotes except graviportal and deep-diving taxa. Their attempt to marshal statistical support for such a near-universal rule may have resulted in some of the problematic choices of data and technique that we found in this work, including the selection of taxa that are clearly inappropriate for direct comparison to spinosaurids, such as taxa without legs or of tiny body size.

Outside of a few well-known scaling relationships in macroecology [158–160], relationships this broad are rare. Moreover, the extensive literature on the bone ballast hypothesis, including studies [50] referenced by Fabbri *et al*., clearly describes many exceptions and alternatives that confound the formulation of a single universal rule. A lesson for future studies is that great care must be taken when drawing sweeping conclusions, particularly if they are contradicted by the available literature or miss large groups that are central to the analysis. Another lesson is that comparative statistical analysis aimed a specific group (“carnivorous dinosaurs” in the case of Fabbri *et al*.) cannot be done properly by using broad amniote-wide databases. Instead, datasets must be tailored to the details of the questions being asked in the analysis. This includes having direct biomechanical relevance to the question at hand; tiny shrews and voles, animals that lack legs, and herbivores arguably have little in common with a giant predatory dinosaur.

In our study, we considered whether the analytical approach adopted by Fabbri *et al*. could be improved by certain changes that would render its results valid. We conclude that it cannot, for multiple reasons. Perhaps the most important of these is that bone microanatomy data on ribs and femora is fundamentally not able to capture the buoyancy effect of the pneumaticity found in the vertebral column in the spinosaurids. Put another way, the bone ballast hypothesis was not formulated for dinosaurs in which the ballast effect is dominated by pneumaticity, which cannot be assessed using rib and femoral data alone.

Including pneumatized taxa in an analysis of the bone ballast hypothesis is not simple and should be the topic of future research. To understand the net influence of different buoyancy effects, it is not enough to analyze bone density (*Cg*) via a simple regression; one also needs to perform a detailed study of the flesh, bone, and air-sac mass in each taxon—ideally employing a full digital model—as opposing factors that must be quantitatively compared. Such a study has now been published for *Spinosaurus*, and the results are overwhelmingly incompatible with an ability to submerge or swim underwater [14]. To repeat a study of this kind for *Baryonyx* or *Suchomimus*—let alone broad datasets of other taxa for comparison—is beyond the scope of the present study but would be valuable, though in the case of *Baryonyx* it might require more complete skeletal material than currently exists.

While the search for new specimens continues, future research is needed to determine whether, or how much, *Cg* varies among individual specimens, through ontogeny, across different skeletal elements, or even between different cross sections of the same bone. We were unable to find any foundational studies that have collected sufficient data to accurately characterize the expected variation in all of these dimensions. However, the variations we did find in the literature and in our own replication study show that variation poses a significant risk to studies that depend on quantitative differences in *Cg*. Future studies should take variation into careful account, and prior studies that performed regression, ANOVA, or LDA on datasets that represent each taxon by a single *Cg* value may need to be revisited. The subjective factors that complicate replication of *Cg* measurements—and might have led to the anomalous values and selection bias that we found in the Fabbri *et al*. datasets—also deserve more research.

The question of whether fossil specimens have a systematically different distribution of *Cg* than extant taxa do must also be resolved, with attention to factors such as matrix infilling and damage repair. The Fabbri *et al*. femoral dataset and the prior studied from which it was compiled is possibly the largest collection of samples ever assembled for which someone has compared the extant and extinct taxa *Cg* values statistically, as done here. Although the dataset was not created to explore differences in *Cg* measured from extinct and extant taxa, a strong bias toward extinct specimens is evident at the high end of the *Cg* range. Whether that is a real effect or an artifact of selection bias should be investigated.

We have identified other pitfalls that complicate the use of the pFDA method in paleobiology. The method does not test whether input datasets are suitable, for example. The present study seems to be the first to systematically test the normality assumption on the discriminants and to assess the effects of the training dataset size on the results. Our results show that sample size, a largely neglected factor in previous applications of LDA and pFDA in biology, affects classification in ways that are important to quantify. Our Monte Carlo analyses illustrate important principles and metrics that can be applied to evaluate whether datasets are large and normal enough to support the use of pFDA to make inferences.

Because pFDA does not produce *P* values or other quantitative estimates of statistical error, the results it produces must be interpreted with caution. In this study, we examined several metrics, including training-set accuracy *A*, the *P*_{2} probability of class membership, and a heuristic concept we introduced as *P*_{rand}, the equivalent probability of random classification versus correct classification. We show that 75% training-set accuracy *A* is equivalent to *P*_{rand} = 0.5, meaning a classifier that is random half the time and correct the other half. *P*_{rand} offers one way to apply conventional thresholds of statistical confidence and significance; the 95% confidence level, for example, is heuristically equivalent to *P*_{rand} ≤ 0.05. But more work must be done to develop formal mathematical metrics of classification performance.

Future pFDA studies that use datasets for which the method is better suited may obtain *P*_{rand} values closer to 0.05 and may be able to draw clear decision boundary lines like those of Fig 1C and 1D. Statistical research is needed to determine what criteria for statistical significance are appropriate for pFDA studies. Until these questions are answered, it will remain difficult for paleontologists to interpret whether pFDA results have the same degree of statistical power and rigor as other statistical methods.

## Supporting information

### S1 Fig. Correlation between global bone compactness (*Cg*) and femoral maximum diameter (*MD*) in Sauropterygia.

Fabbri *et al*. [15] include data from six specimens of *Nothosaurus*, two of the related nothosaur *Simosaurus*, and one related pachypleurosaur, *Serpicosaurus*. Each point is labeled with the identifier used in the Fabbri *et al*. datasets. A strong inverse correlation is shown between global bone compactness (*Cg*) and femoral *MD*, which is commonly used as a proxy for body size. The blue regression line only includes data points for *Nothosaurus*, the black regression line includes all taxa in the plot. Regression parameters are shown in the inset table. The coefficient of determination is extremely high (*R*^{2} = 0.96) for *Nothosaurus* alone but still very high (*R*^{2} = 0.84) for these sauropterygian taxa pooled together. The source of this strong trend is unknown to us; it could be a real biological effect, or a data artifact, or some combination thereof. If extrapolated, these trends would have *Cg* = 0 at *MD* = 103 mm for *Nothosaurus* and *MD* = 108 mm for all taxa, which is biologically impossible.

https://doi.org/10.1371/journal.pone.0298957.s001

(TIF)

### S2 Fig. *Spinosaurus* aegyptiacus subadult femur retaining medullary cavity (CMN 41869).

(A) Proximal half of the right femur in medial view. (B) Medullary cavity in ventrolateral view. (C) Bone lining the medullary cavity. Abbreviations: at, anterior trochanter; cb, cancellous bone; ft, fourth trochanter; hd, head; mc, medullary cavity.

https://doi.org/10.1371/journal.pone.0298957.s002

(TIF)

### S3 Fig. *Suchomimus tenerensis* juvenile femoral mid shaft thin section.

Thin section of the midshaft of a right femur of a juvenile individual (femur length 55.3 cm; MNBH GAD72).

https://doi.org/10.1371/journal.pone.0298957.s003

(TIF)

### S4 Fig. Variant datapoints for the femoral ds1 dataset from Table 8.

The hypothetical spinosaurid datapoints for femoral data (ds1) from Table 8 are plotted by *MD* and *Cg* to illustrate the effect of the variations. Numbers correspond to the variation suffixes in Table 8; the original datapoints used by *Fabbri et al*. are labelled with *Sp* for *Spinosaurus*, *Su* for *Suchomimus*, and *Ba* for *Baryonyx*. Points are colored according to the legend. Arrows indicate how far each of the variations is displaced in *MD* and *Cg* from the original data point.

https://doi.org/10.1371/journal.pone.0298957.s004

(TIF)

### S5 Fig. Bootstrap distributions of *P*_{2} for *Spinosaurus* sensitivity cases 0–3.

Bootstrap analysis was used with 2000 trials to predict *P*_{2}, the posterior probability of *Spinosaurus* belonging to the class of “subaqueous foragers.” Each bootstrap trial contains the results of 100 random trees, so there are a total of 200,000 predictions. Histograms show the distribution of *P*_{2} for the *Spinosaurus* sensitivity analysis variations 0–3 of Table 8. Vertical gray lines and numbers along the top of each chart show the medians and their 95% CI, as determined by the BCa bootstrap confidence integral algorithm. Vertical red lines show the 2.5%, 50%, and 97.5% quantiles of the bootstrap distributions of *P*_{2}. In a case where the bootstrap distribution has the same median as the original dataset prior to bootstrapping, there would be no bias. In general, however, bootstrapping can introduce bias. The BCa bootstrap algorithm adjusts the bias and also corrects for nonconstant variance. As a result, the BCa 95% CI does not always line up with the quantiles of the bootstrap distribution (*i*.*e*., gray lines and red lines may not overlap).

https://doi.org/10.1371/journal.pone.0298957.s005

(TIF)

### S6 Fig. Bootstrap distributions of *P*_{2} for *Spinosaurus* sensitivity cases 4–7.

Bootstrap analysis was used with 2000 trials to predict *P*_{2}, the posterior probability of *Spinosaurus* belonging to the class of “subaqueous foragers.” Each bootstrap trial contains the results of 100 random trees, so there are a total of 200,000 predictions. Histograms show the distribution of *P*_{2} for the *Spinosaurus* sensitivity analysis variations 4–7 of Table 8. Vertical gray lines and numbers along the top of each chart show the medians and their 95% CI, as determined by the BCa bootstrap confidence integral algorithm. Vertical red lines show the 2.5%, 50%, and 97.5% quantiles of the bootstrap distributions of *P*_{2}. In a case where the bootstrap distribution has the same median as the original dataset prior to bootstrapping, there would be no bias. In general, however, bootstrapping can introduce bias. The BCa bootstrap algorithm adjusts the bias and also corrects for nonconstant variance. As a result, the BCa 95% CI does not always line up with the quantiles of the bootstrap distribution (*i*.*e*., gray lines and red lines may not overlap).

https://doi.org/10.1371/journal.pone.0298957.s006

(TIF)

### S7 Fig. Bootstrap distributions of *P*_{2} for *Spinosaurus* sensitivity cases 8 and 9.

Bootstrap analysis was used with 2000 trials to predict *P*_{2}, the posterior probability of *Spinosaurus* belonging to the class of “subaqueous foragers.” Each bootstrap trial contains the results of 100 random trees, so there are a total of 200,000 predictions. Histograms show the distribution of *P*_{2} for the *Spinosaurus* sensitivity analysis variations 8 and 9 of Table 8. Vertical gray lines and numbers along the top of each chart show the medians and their 95% CI, as determined by the BCa bootstrap confidence integral algorithm. Vertical red lines show the 2.5%, 50%, and 97.5% quantiles of the bootstrap distributions of *P*_{2}. In a case where the bootstrap distribution has the same median as the original dataset prior to bootstrapping, there would be no bias. In general, however, bootstrapping can introduce bias. The BCa bootstrap algorithm adjusts the bias and also corrects for nonconstant variance. As a result, the BCa 95% CI does not always line up with the quantiles of the bootstrap distribution (*i*.*e*., gray lines and red lines may not overlap).

https://doi.org/10.1371/journal.pone.0298957.s007

(TIF)

### S8 Fig. One-dimensional and two-dimensional effect size statistics on the Fabbri *et al*. and corrected training datasets, compared to original and remeasured spinosaurid values.

(A) The plots use bars to show the 95% confidence interval and 95% single-prediction intervals for the mean value of *Cg* in the femoral and rib training sets. Within each training set, the intervals for the *F0D0* group are shown as red (95% CI) and pink (95% prediction) bars; the intervals for *F0D2* are shown in blue and cyan, respectively. The values for spinosaurid taxa used in Fabbri *et al*. [15] are marked with solid black markers. The confidence and prediction intervals for the mean provide a simple one-dimensional view of the overlap in distributions between the *F0D0* and *F0D2* groups. In the femoral dataset, the 95% confidence interval of the mean of *F0D2* lies entirely within the prediction interval of *F0D0*, showing that even the mean *Cg* in *F0D2* would be plausible as a member of *F0D0*. In the rib dataset, the mean 95% CI for *F0D2* is mostly within the prediction interval for *F0D0*. The 95% CI for the mean of *F0D0* overlaps with the prediction interval of *F0D2* for femoral data and falls entirely within the interval for rib data. In each case we see that an average value of *Cg* distribution of one group (say, *F0D2* divers) is plausible as a member of the opposite group (the *F0D0* nondivers) and vice versa. The overlap in *Cg* for the groups occurs not only at the edge cases of a group but also extends to group average. (B) Linear regressions (performed without phylogenetic bias adjustment) of (*Cg*, Log(10, *MD*)) are plotted with their with 95% prediction interval for the *F0D0* and *F0D2* groups of femoral and rib datasets. Outputs for *R*^{2} from the lm() function in R are reported. The two-dimensional intervals show that the overlap evident in the *Cg* plots of (A) is also present when diameter is considered. The regression results show that these two-dimensional regressions have extremely weak correlation and have somewhat minor impact on our interpretations, although they often produce *F0D0* 95% prediction intervals even closer to *Spinosaurus* values in the bivariate space. The weak correlations support the conclusion by both Fabbri *et al*. and ourselves that including bone diameter likely does not improve the predictive ability of the model.

https://doi.org/10.1371/journal.pone.0298957.s008

(TIF)

### S9 Fig. Quantile-quantile plots of pFDA discriminants from dataset ds1 subsets *F0D0* and *F0D2*.

In these panels, the quantiles of the discriminant distributions versus those of a normal or uniform distribution (heavy black points) can be compared to plots of the normal or uniform distribution with itself (thin dotted lines).

https://doi.org/10.1371/journal.pone.0298957.s009

(TIF)

### S10 Fig. Quantile-quantile plots of pFDA discriminants from dataset ds2 subsets *F0D0* and *F0D2*.

In these panels, the quantiles of the discriminant distributions versus those of a normal or uniform distribution (heavy black points) can be compared to plots of the normal or uniform distribution with itself (thin dotted lines).

https://doi.org/10.1371/journal.pone.0298957.s010

(TIF)

### S1 Table. Settings for computed-tomographic scans of each of the specimens described.

Links are provided to Morphosource records containing CT scans created for this study.

https://doi.org/10.1371/journal.pone.0298957.s011

(DOCX)

### S1 File. Femur compactness all.

This data file, which is processed by the R script of Fabbri *et al*., contains the full femoral dataset for that study. We denote this dataset ds1 in our study.

https://doi.org/10.1371/journal.pone.0298957.s012

(CSV)

### S2 File. Rib compactness all.

This data file, which is processed by the R script of Fabbri *et al*., contains the full rib dataset for that study. We denote this dataset ds2 in our study.

https://doi.org/10.1371/journal.pone.0298957.s013

(CSV)

### S3 File. Femur compactness no graviportals no pelagics.

This data file, which is processed by the R script of Fabbri *et al*., contains a femoral dataset that was reduced by elimination of selected taxa deemed graviportal or deep-diving. We denote this dataset ds3 in our study.

https://doi.org/10.1371/journal.pone.0298957.s014

(CSV)

### S4 File. Rib compactness no graviportals_no pelagics.

This data file, which is processed by the R script of Fabbri *et al*., contains a rib dataset that was reduced by elimination of selected taxa deemed graviportal or deep-diving. We denote this dataset ds4 in our study.

https://doi.org/10.1371/journal.pone.0298957.s015

(CSV)

### S1 Appendix. Supporting information on methodological issues.

The Appendix provides additional details, figures, and equations elaborating on our methods and results, in five sections: (1) the ecological fallacy; (2) ROC curves and whether P_{2}>0.5 is the best threshold; (3) *P* values and the *p*<0.05 threshold; (4) classification performance metrics; and (5) a bug or misunderstanding in pFDA codes.

https://doi.org/10.1371/journal.pone.0298957.s016

(DOCX)

## Acknowledgments

We thank Wayt Gibbs for editorial assistance, Lauren Conroy for assistance with several figures, Jordan Mallon for assistance in obtaining images and CT scans of specimens in his care, Nicole Klein for providing thin-section images, Cem Ozen for programming assistance, and 3ric Johanson for assistance in publishing our code and data. We thank David Hone and two anonymous reviewers for their helpful suggestions.

## References

- 1.
Stromer E. Ergebnisse der Forschungreisen Prof. E. Stromers in den Wüsten Ägyptens II. Wirbeltierreste der Baharîje-Stufe (unterstes Cenoman) 3. Das Original de Theropoden
*Spinosaurus aegyptiacus*nov. gen., nov. spec. Abh Konglich Bayer Akad Wiss Math-Phys Cl. 1915;28: 1–32. - 2.
Charig AJ, Milner AC.
*Baryonyx*, a remarkable new theropod dinosaur. Nature. 1986;324: 359–361. pmid:3785404 - 3. Sereno PC, Beck AL, Dutheil DB, Gado B, Larsson HCE, Lyon GH, et al. A long-snouted predatory dinosaur from Africa and the evolution of spinosaurids. Science. 1998;282: 1298–1302. pmid:9812890
- 4. Charig AJ, Milner AC. Baryonyx walkeri, a fish-eating dinosaur from the Wealden of Surrey. Bull-Nat Hist Mus Geol Ser. 1997;53: 11–70.
- 5. Ibrahim N, Sereno PC, Dal Sasso C, Maganuco S, Fabbri M, Martill DM, et al. Semiaquatic adaptations in a giant predatory dinosaur. Science. 2014;345: 1613–1616. pmid:25213375
- 6. Hone DWE, Holtz TR Jr. A century of spinosaurs—a review and revision of the Spinosauridae with comments on their ecology. Acta Geol Sin—Engl Ed. 2017;91: 1120–1132.
- 7.
Gimsa J, Sleigh R, Gimsa U. The riddle of
*Spinosaurus aegyptiacus*’s dorsal sail. Geol Mag. 2016;153: 544–547. - 8.
Henderson DM. A buoyancy, balance and stability challenge to the hypothesis of a semi-aquatic
*Spinosaurus*Stromer, 1915 (Dinosauria: Theropoda). PeerJ. 2018;6: e5409. pmid:30128195 - 9. Arden TMS, Klein CG, Zouhri S, Longrich NR. Aquatic adaptation in the skull of carnivorous dinosaurs (Theropoda: Spinosauridae) and the evolution of aquatic habits in spinosaurids. Cretac Res. 2019;93: 275–284.
- 10. Hone DWE, Holtz TR Jr. Comment on: Aquatic adaptation in the skull of carnivorous dinosaurs (Theropoda: Spinosauridae) and the evolution of aquatic habits in spinosaurids. 93: 275–284. Cretac Res. 2022;134: 104152.
- 11. Ibrahim N, Maganuco S, Dal Sasso C, Fabbri M, Auditore M, Bindellini G, et al. Tail-propelled aquatic locomotion in a theropod dinosaur. Nature. 2020;581: 67–70. pmid:32376955
- 12.
Gimsa J, Gimsa U. Contributions to a discussion of
*Spinosaurus aegyptiacus*as a capable swimmer and deep-water predator. Life. 2021;11: 889. pmid:34575038 - 13.
Hone D, Holtz TR Jr. Evaluating the ecology of
*Spinosaurus*: shoreline generalist or aquatic pursuit specialist? Palaeontol Electron. 2021;24: a03. - 14.
Sereno PC, Myhrvold N, Henderson DM, Fish FE, Vidal D, Baumgart SL, et al.
*Spinosaurus*is not an aquatic dinosaur. eLife. 2022;11: e80092. pmid:36448670 - 15. Fabbri M, Navalón G, Benson RBJ, Pol D, O’Connor J, Bhullar B-AS, et al. Subaqueous foraging among carnivorous dinosaurs. Nature. 2022;603: 852–857. pmid:35322229
- 16. Fabbri M, Navalón G, Benson RBJ, Pol D, O’Connor J, Bhullar B-AS, et al. Sinking a giant: quantitative macroevolutionary comparative methods debunk qualitative assumptions. bioRxiv; 2022.
- 17. Gônet J, Laurin M, Girondot M. BoneProfileR: The next step to quantify, model, and statistically compare bone section compactness profiles. Palaeontol Electron. 2022.
- 18. Girondot M, Laurin M. Bone profiler: a tool to quantify, model, and statistically compare bone-section compactness profiles. J Vertebr Paleontol. 2003;23: 458–461.
- 19. Fabbri M, Navalón G, Benson RBJ, Pol D, O’Connor J, Bhullar B-AS, et al. Supplementary Dataset to Subaqueous foraging among carnivorous dinosaurs. 2022. pmid:35322229
- 20. Schmitz L. Phylogenetic flexible discriminant analysis (Motani and Schmitz 2011, Evolution). 2022. Available: https://github.com/lschmitz/phylo.fda.
- 21. Motani R, Schmitz L. Phylogenetic versus functional signals in the evolution of form–function relationships in terrestrial vision. Evolution. 2011;65: 2245–2257. pmid:21790572
- 22. Schmitz L, Motani R. Nocturnality in dinosaurs inferred from scleral ring and orbit morphology. Science. 2011;332: 705–708. pmid:21493820
- 23. Tanaka K, Zelenitsky DK, Therrien F. Eggshell porosity provides insight on evolution of nesting in dinosaurs. Shawkey M, editor. PLOS ONE. 2015;10: e0142829. pmid:26605799
- 24. Smith SM, Angielczyk KD, Schmitz L, Wang SC. Do bony orbit dimensions predict diel activity pattern in sciurid rodents? Anat Rec. 2018;301: 1774–1787. pmid:30369077
- 25. De Mendoza RS, Gómez RO. Ecomorphology of the tarsometatarsus of waterfowl (Anseriformes) based on geometric morphometrics and its application to fossils. Anat Rec. 2022;305: 3243–3253. pmid:35132811
- 26. Pérez-Ben CM, Lires AI, Gómez RO. Frog limbs in deep time: is jumping locomotion at the roots of the anuran Bauplan? Paleobiology. 2023; 1–12.
- 27.
Efron B, Tibshirani R. An introduction to the bootstrap. New York: Chapman & Hall; 1993.
- 28. Efron B, Narasimhan B. The automatic construction of bootstrap confidence intervals. J Comput Graph Stat. 2020;29: 608–619. pmid:33727780
- 29. Puth M-T, Neuhäuser M, Ruxton GD. On the variety of methods for calculating confidence intervals by bootstrapping. J Anim Ecol. 2015;84: 892–897. pmid:26074184
- 30. Polly PD. Phylogenetics-for-Mathematica. 2023. Available: https://github.com/pdpolly/Phylogenetics-for-Mathematica.
- 31. Myhrvold NP. Diving-dinosaurs. 2023. Available: https://github.com/intvenlab/Diving-dinosaurs.
- 32.
Good P. Permutation tests: a practical guide to resampling methods for testing hypotheses. New York: Springer; 1994.
- 33.
Mathematica. Wolfram; 2021. Available: https://www.wolfram.com/mathematica/.
- 34.
Burnham KP, Anderson DR, editors. Model selection and multimodel inference. second. New York, NY: Springer New York; 2004. https://doi.org/10.1007/b97636
- 35. Fernández Pierna JA, Massart DL. Improved algorithm for clustering tendency. Anal Chim Acta. 2000;408: 13–20.
- 36. Lawson RG, Jurs PC. New index for clustering tendency and its application to chemical problems. J Chem Inf Comput Sci. 1990;30: 36–41.
- 37. Hastie T, Tibshirani R, Buja A. Flexible discriminant analysis by optimal scoring. J Am Stat Assoc. 1994;89: 1255–1270.
- 38. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936;7: 179–188.
- 39.
Hastie T, Friedman J, Tibshirani R. The Elements of Statistical Learning. New York, NY: Springer New York; 2001. https://doi.org/10.1007/978-0-387-21606-5
- 40. Canoville A, Laurin M. Microanatomical diversity of the humerus and lifestyle in lissamphibians. Acta Zool. 2009;90: 110–122.
- 41. Germain D, Laurin M. Microanatomy of the radius and lifestyle in amniotes (Vertebrata, Tetrapoda). Zool Scr. 2005;34: 335–350.
- 42. Kriloff A, Germain D, Canoville A, Vincent P, Sache M, Laurin M. Evolution of bone microanatomy of the tetrapod tibia and its use in palaeobiological inference. J Evol Biol. 2008;21: 807–826. pmid:18312321
- 43. Hayashi S, Houssaye A, Nakajima Y, Chiba K, Ando T, Sawamura H, et al. Bone inner structure suggests increasing aquatic adaptations in Desmostylia (Mammalia, Afrotheria). Viriot L, editor. PLOS ONE. 2013;8: e59146. pmid:23565143
- 44. Canoville A, Laurin M. Evolution of humeral microanatomy and lifestyle in amniotes, and some comments on palaeobiological inferences. Biol J Linn Soc. 2010;100: 384–406.
- 45. Sun D, Zhou X, Yu Z, Xu S, Seim I, Yang G. Accelerated evolution and diversifying selection drove the adaptation of cetacean bone microstructure. BMC Evol Biol. 2019;19: 194. pmid:31651232
- 46. Canoville A, de Buffrénil V, Laurin M. Microanatomical diversity of amniote ribs: an exploratory quantitative study. Biol J Linn Soc. 2016;118: 706–733.
- 47. Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567: 305–307. pmid:30894741
- 48.
Aggarwal CC. Data mining: the textbook. 1st ed. 2015. Cham: Springer; 2015. https://doi.org/10.1007/978-3-319-14142-8
- 49.
Zaki MJ, Meira W. Data mining and analysis: fundamental concepts and algorithms. New York, NY: Cambridge University Press; 2014.
- 50. Houssaye A. “Pachyostosis” in aquatic amniotes: a review. Integr Zool. 2009;4: 325–340. pmid:21392306
- 51. Dumont ER. Bone density and the lightweight skeletons of birds. Proc R Soc B Biol Sci. 2010;277: 2193–2198. pmid:20236981
- 52. Houssaye A, Martin Sander P, Klein N. Adaptive patterns in aquatic amniote bone microanatomy—more complex than previously thought. Integr Comp Biol. 2016;56: 1349–1369. pmid:27794536
- 53. Houssaye A, Waskow K, Hayashi S, Cornette R, Lee AH, Hutchinson JR. Biomechanical evolution of solid bones in large animals: a microanatomical investigation. Biol J Linn Soc. 2016;117: 350–371.
- 54. Taylor MA. Functional significance of bone ballast in the evolution of buoyancy control strategies by aquatic tetrapods. Hist Biol. 2000;14: 15–31.
- 55. Quemeneur S, de Buffrénil V, Laurin M. Microanatomy of the amniote femur and inference of lifestyle in limbed vertebrates: Femoral Microanatomy and Lifestyle. Biol J Linn Soc. 2013;109: 644–655.
- 56. Huttenlocker AK, Shelton CD. Bone histology of varanopids (Synapsida) from Richards Spur, Oklahoma, sheds light on growth patterns and lifestyle in early terrestrial colonizers. Philos Trans R Soc B Biol Sci. 2020;375: 20190142. pmid:31928198
- 57. Meier PS, Bickelmann C, Scheyer TM, Koyabu D, Sánchez-Villagra MR. Evolution of bone compactness in extant and extinct moles (Talpidae): exploring humeral microstructure in small fossorial mammals. BMC Evol Biol. 2013;13: 55. pmid:23442022
- 58. Scotcher JSB, Stewart DRM, Breen CM. The diet of the hippopotamus in Ndumu Game Reserve Natal, as determined by faecal analysis. South Afr J Wildl Res. 1978;8: 1–11.
- 59. Hendier A, Chatelain C, Du Pasquier P, Paris M, Ouattara K, Koné I, et al. A new method to determine the diet of pygmy hippopotamus in Taï National Park, Côte d’Ivoire. Afr J Ecol. 2021;59: 809–825.
- 60.
Henry O, Feer F, Sabatier D. Diet of the lowland tapir (
*Tapirus terrestris L*.) in French Guiana. Biotropica. 2000;32: 364–368. - 61.
Mohamed NZ, Traeholt C. A preliminary study of habitat selection by Malayan Tapir,
*Tapirus indicus*, in Krau Wildlife Reserve, Malaysia. Tapir Conserv News. 2010;19/2: 32–35. - 62.
Haarberg O, Rosell F. Selective foraging on woody plant species by the Eurasian beaver (
*Castor fiber*) in Telemark, Norway. J Zool. 2006;270: 201–208. - 63.
Neyland PJ. Habitat, home range, diet and demography of the water vole (
*Arvicola amphibius*): Patch-use in a complex wetland landscape. Swansea University. 2011. - 64. Myhrvold N, Sereno PC, Baumgart SL, Formoso KK, Vidal D, Fish FE, et al. Spinosaurids as ‘subaqueous foragers’ undermined by selective sampling and problematic statistical inference. bioRxiv; 2022.
- 65.
Harper LR, Watson HV, Donnelly R, Hampshire R, Sayer CD, Breithaupt T, et al. Using DNA metabarcoding to investigate diet and niche partitioning in the native European otter (
*Lutra lutra*) and invasive American mink (*Neovison vison*). Metabarcoding Metagenomics. 2020; e56087. - 66.
Biffi M, Gillet F, Laffaille P, Colas F, Aulagnier S, Blanc F, et al. Novel insights into the diet of the Pyrenean desman (
*Galemys pyrenaicus*) using next-generation sequencing molecular analyses. J Mammal. 2017. - 67. Wolfe JL, Bradshaw DK, Chabreck RH. Alligator feeding habits: new data and a review. Northeast Gulf Sci. 1987;9.
- 68. Shoop CR, Ruckdeschel CA. Alligators as predators on terrestrial mammals. Am Midl Nat. 1990;124: 407.
- 69.
Woodborne S, Botha H, Huchzermeyer D, Myburgh J, Hall G, Myburgh A. Ontogenetic dependence of Nile crocodile (
*Crocodylus niloticus*) isotope diet‐to‐tissue discrimination factors. Rapid Commun Mass Spectrom. 2021;35. pmid:34224610 - 70. Adame MF, Jardine TD, Fry B, Valdez D, Lindner G, Nadji J, et al. Estuarine crocodiles in a tropical coastal floodplain obtain nutrition from terrestrial prey. PLOS ONE. 2018;13: e0197159. pmid:29874276
- 71.
Nifong JC, Nifong RL, Silliman BR, Lowers RH, Guillette LJ, Ferguson JM, et al. Animal-borne imaging reveals novel insights into the foraging behaviors and diel activity of a large-bodied apex predator, the American Alligator (
*Alligator mississippiensis*). Marshall CD, editor. PLOS ONE. 2014;9: e83953. pmid:24454711 - 72.
Wallace KM, Leslie AJ. Diet of the Nile crocodile (
*Crocodylus niloticus*) in the Okavango Delta, Botswana. J Herpetol. 2008;42: 361–368. - 73.
Platt SG, Rainwater TR, Finger AG, Thorbjarnarson JB, Anderson TA, McMurry ST. Food habits, ontogenetic dietary partitioning and observations of foraging behaviour of Morelet’s crocodile (
*Crocodylus moreletii*) in northern Belize. Herpetol J. 2006;16: 281–290. - 74. Magnusson WE, da Silva EV, Lima AP. Diets of Amazonian crocodilians. J Herpetol. 1987;21: 85.
- 75.
Webb G, Manolis S, Buckworth R.
*Crocodylus johnstoni*in the McKinlay River area, N.T. I. Variation in the diet, and a new method of assessing the relative importance of prey. Aust J Zool. 1982;30: 877. - 76.
Hussain SA. Basking site and water depth selection by gharial
*Gavialis gangeticus*Gmelin 1789 (Crocodylia, Reptilia) in National Chambal Sanctuary, India and its implication for river conservation. Aquat Conserv Mar Freshw Ecosyst. 2009;19: 127–133. - 77.
Tucker AD, Limpus CJ, McCallum HI, McDonald KR. Ontogenetic dietary partitioning by
*Crocodylus johnstoni*during the dry season. Copeia. 1996;1996: 978. - 78.
Radloff FGT, Hobson KA, Leslie AJ. Characterising ontogenetic niche shifts in Nile crocodile using stable isotope (δ
^{13}C, δ^{15}N) analyses of scute keratin. Isotopes Environ Health Stud. 2012;48: 439–456. pmid:22462522 - 79. Fish FE. Kinematics of undulatory swimming in the American alligator. Copeia. 1984;1984: 839.
- 80. Holtz TR. Theropod guild structure and the tyrannosaurid niche assimilation hypothesis: implications for predatory dinosaur macroecology and ontogeny in later Late Cretaceous Asiamerica. Can J Earth Sci. 2021;58: 778–795.
- 81. Schroeder K, Lyons SK, Smith FA. The influence of juvenile dinosaurs on community structure and diversity. Science. 2021;371: 941–944. pmid:33632845
- 82.
Woodward HN, Tremaine K, Williams SA, Zanno LE, Horner JR, Myhrvold N. Growing up
*Tyrannosaurus rex*: osteohistology refutes the pygmy “*Nanotyrannus*” and supports ontogenetic niche partitioning in juvenile*Tyrannosaurus*. Sci Adv. 2020;6: eaax6250. pmid:31911944 - 83. Gard R. Brown bear predation on sockeye salmon at Karluk Lake, Alaska. J Wildl Manag. 1971;35: 193.
- 84. Reimchen TE. Some ecological and evolutionary aspects of bear-salmon interactions in coastal British Columbia. Can J Zool. 2000;78: 448–457.
- 85. Darimont CT, Reimchen TE, Paquet PC. Foraging behaviour by gray wolves on salmon streams in coastal British Columbia. Can J Zool. 2003;81: 349–353.
- 86.
Stanek AE, Wolf N, Hilderbrand GV, Mangipane B, Causey D, Welker JM. Seasonal foraging strategies of Alaskan gray wolves (
*Canis lupus*) in an ecosystem subsidized by Pacific salmon (*Oncorhynchus spp*.). Can J Zool. 2017;95: 555–563. - 87. Da Silveira R, Ramalho EE, Thorbjarnarson JB, Magnusson WE. Depredation by jaguars on caimans and importance of reptiles in the diet of jaguar. J Herpetol. 2010;44: 418–424.
- 88. Azevedo FCC, Verdade LM. Predator–prey interactions: jaguar predation on caiman in a floodplain forest. Braae A, editor. J Zool. 2012;286: 200–207.
- 89.
Miranda EBP, Menezes JFS de, Rheingantz ML. Reptiles as principal prey? Adaptations for durophagy and prey selection by jaguar (
*Panthera onca*). J Nat Hist. 2016;50: 2021–2035. - 90. Hilderbrand GV, Schwartz CC, Robbins CT, Jacoby ME, Hanley TA, Arthur SM, et al. The importance of meat, particularly salmon, to body size, population productivity, and conservation of North American brown bears. Can J Zool. 1999;77: 132–138.
- 91. Thompson CM, Nye PE, Schmidt GA, Garcelon DK. Foraging ecology of bald eagles in a freshwater tidal system. Boal, editor. J Wildl Manag. 2005;69: 609–617.
- 92. Schaadt CP, Rymon LM. Innate fishing behavior of ospreys. Raptor Res. 1982;16: 61–62.
- 93. Martin GR, Mcneil R, Rojas LM. Vision and the foraging technique of skimmers (Rynchopidae): Foraging in skimmers. Ibis. 2007;149: 750–757.
- 94. Kasner AC, Dixon TP. Aerial foraging over open water by great egrets and snowy egrets on schooling freshwater fish. Wilson Bull. 2003;115: 199–200.
- 95. Kushlan JA. Resource use strategies of wading birds. Wilson Bull. 1981; 145–163.
- 96. Guillet A. Aspects of the foraging bevahiour of the shoebill. Ostrich. 1979;50: 252–255.
- 97. Recher HF, Holmes RT, Davis WE, Morton S. Foraging behavior of Australian herons. Colon Waterbirds. 1983;6: 1.
- 98. Coughlin BL, Fish FE. Hippopotamus underwater locomotion: reduced-gravity movements for a massive mammal. J Mammal. 2009;90: 675–679.
- 99. O’Connor PM. Evolution of archosaurian body plans: skeletal adaptations of an air-sac-based breathing apparatus in birds and other archosaurs. J Exp Zool Part Ecol Genet Physiol. 2009;311A: 629–646. pmid:19810215
- 100. Burton MGP, Benson RBJ, Field DJ. Direct quantification of skeletal pneumaticity illuminates ecological drivers of a key avian trait. Proc R Soc B Biol Sci. 2023;290: 20230160. pmid:36919426
- 101. Smith ND. Body mass and foraging ecology predict evolutionary patterns of skeletal pneumaticity in the diverse “waterbird” clade: Phylogenetic patterns in waterbird pneumaticity. Evolution. 2012;66: 1059–1078. pmid:22486689
- 102.
Evers SW, Rauhut OWM, Milner AC, McFeeters B, Allain R. A reappraisal of the morphology and systematic position of the theropod dinosaur
*Sigilmassasaurus*from the “middle” Cretaceous of Morocco. PeerJ. 2015;3: e1323. pmid:26500829 - 103.
Schachner ER, Hedrick BP, Richbourg HA, Hutchinson JR, Farmer C. Anatomy, ontogeny, and evolution of the archosaurian respiratory system: A case study on
*Alligator mississippiensis*and*Struthio camelus*. J Anat. 2021;238: 845–873. pmid:33345301 - 104. Schachner ER, Lawson AB, Martinez A, Grand Pre CA, Sabottke C, Abou-Issa F, et al. Perspectives on lung visualization: Three-dimensional anatomical modeling of computed and micro-computed tomographic data in comparative evolutionary morphology and medicine with applications for COVID-19. Anat Rec. 2023;n/a. pmid:37528640
- 105.
Legendre LJ, Botha-Brink J. Digging the compromise: Investigating the link between limb bone histology and fossoriality in the aardvark (
*Orycteropus afer*). PeerJ. 2018;6: e5216. pmid:30018860 - 106. Fish FE, Stein BR. Functional correlates of differences in bone density among terrestrial and aquatic genera in the family Mustelidae (Mammalia). Zoomorphology. 1991;110: 339–345.
- 107. Mallet C, Cornette R, Billet G, Houssaye A. Interspecific variation in the limb long bones among modern rhinoceroses—extent and drivers. PeerJ. 2019;7: e7647. pmid:31579585
- 108. Hutchinson JR. The evolutionary biomechanics of locomotor function in giant land animals. J Exp Biol. 2021;224: jeb217463. pmid:34100541
- 109. Chinsamy A, Angst D, Canoville A, Göhlich UB. Bone histology yields insights into the biology of the extinct elephant birds (Aepyornithidae) from Madagascar. Biol J Linn Soc. 2020;130: 268–295.
- 110. Canoville A, Chinsamy A, Angst D. New comparative data on the long bone microstructure of large extant and extinct flightless birds. Diversity. 2022;14: 298.
- 111.
Mallet C, Cornette R, Billet G, Houssaye A. Are rhinoceros graviportal? Morphofunctional 3D-analysis of modern rhinoceros limb long bones. ICVM 2019: International Congress of Vertebrate Morphology. Prague, Czech Republic: Wiley; 2019. pp. S171–172. https://doi.org/10.1002/jmor.21003
- 112. Taylor MA. Stomach stones for feeding or buoyancy? The occurrence and function of gastroliths in marine tetrapods. Philos Trans R Soc Lond B Biol Sci. 1997;341: 163–175.
- 113. Henderson DM. Floating point: a computational study of buoyancy, equilibrium, and gastroliths in plesiosaurs. Lethaia. 2006;39: 227–244.
- 114. Henderson DM. Effects of stomach stones on the buoyancy and equilibrium of a floating crocodilian: a computational analysis. Can J Zool. 2003;81: 1346–1357.
- 115. Mateus O. Lourinhanosaurus antunesi, a new upper Jurassic allosauroid (Dinosauria: Theropoda) from Lourinhã, Portugal. Mem Acad Ciênc Lisb. 1998;37: 111–124.
- 116. Wings O. A review of gastrolith function with implications for fossil vertebrates and a revised classification. Acta Palaeontol Pol. 2007;52: 1.
- 117. Ruff C, Holt B, Trinkaus E. Who’s afraid of the big bad Wolff?: “Wolff’s law” and bone functional adaptation. Am J Phys Anthropol. 2006;129: 484–498. pmid:16425178
- 118. Frost HM. Skeletal structural adaptations to mechanical usage (SATMU): 1. Redefining Wolff’s Law: The bone modeling problem. Anat Rec. 1990;226: 403–413. pmid:2184695
- 119. Frost HM. Skeletal structural adaptations to mechanical usage (SATMU): 2. Redefining Wolff’s Law: The remodeling problem. Anat Rec. 1990;226: 414–422. pmid:2184696
- 120. D’Angelo JS, Garcia Marsà JA, Agnolín FL, Novas FE. Biological implications of the bone microstructure of a new elasmosaurid (Sauropterygia, Plesiosauroidea) from the uppermost Cretaceous of Patagonia. Hist Biol. 2023; 1–9.
- 121. Ksepka DT, Werning S, Sclafani M, Boles ZM. Bone histology in extant and fossil penguins (Aves: Sphenisciformes). J Anat. 2015;227: 611–630. pmid:26360700
- 122. Kooyman GL, Goetz K, Williams CL, Ponganis PJ, Sato K, Eckert S, et al. Crary bank: a deep foraging habitat for emperor penguins in the western Ross Sea. Polar Biol. 2020;43: 801–811.
- 123.
Alexander RMcN, Pond CM. Locomotion and bone strength of the white rhinoceros,
*Ceratotherium simum*. J Zool. 1992;227: 63–69. - 124.
Etienne C, Houssaye A, Hutchinson JR. Limb myology and muscle architecture of the Indian rhinoceros
*Rhinoceros unicornis*and the white rhinoceros*Ceratotherium simum*(Mammalia: Rhinocerotidae). PeerJ. 2021;9: e11314. pmid:34026351 - 125. Gregory WK. Notes on the principles of quadrupedal locomotion and on the mechanism of the limbs in hoofed animals. Ann N Y Acad Sci. 1912;22: 267–294.
- 126. Carrano MT. What, if anything, is a cursor? Categories versus continua for determining locomotor habit in mammals and dinosaurs. J Zool. 1999;247: 29–42.
- 127. Christiansen P. Strength indicator values of theropod long bones, with comments on limb proportions and cursorial potential. Gaia. 1998;15: 241–255.
- 128.
Storer RW. Adaptive radiation of birds. In: Marshall AJ, editor. Biology and comparative physiology of birds. Academic Press; 1971. pp. 15–55.
- 129. Angst D, Buffetaut E, Lecuyer C, Amiot R. A new method for estimating locomotion type in large ground birds. Benson R , editor. Palaeontology. 2016;59: 217–223.
- 130. Buffetaut E, Angst D. An introduction to evolution and palaeobiology of flightless birds. Diversity. 2022;14: 296.
- 131. Waskow K, Martin Sander P. Growth record and histological variation in the dorsal ribs of Camarasaurus sp. (Sauropoda). J Vertebr Paleontol. 2014;34: 852–869.
- 132. Houssaye A, Martin F, Boisserie J-R, Lihoreau F. Paleoecological inferences from long bone microanatomical specializations in Hippopotamoidea (Mammalia, Artiodactyla). J Mamm Evol. 2021;28: 847–870.
- 133. Amson E, de Muizon C, Laurin M, Argot C, de Buffrénil V. Gradual adaptation of bone structure to aquatic lifestyle in extinct sloths from Peru. Proc R Soc B Biol Sci. 2014;281: 20140192. pmid:24621950
- 134. Houssaye A, Botton-Divet L. From land to water: evolutionary changes in long bone microanatomy of otters (Mammalia: Mustelidae). Biol J Linn Soc. 2018;125: 240–249.
- 135.
Klein N, Sander PM, Krahl A, Scheyer TM, Houssaye A. Diverse aquatic adaptations in
*Nothosaurus*spp. (Sauropterygia)—inferences from humeral histology and microanatomy. PLOS ONE. 2016;11: e0158448. pmid:27391607 - 136.
Scheyer TM, Klein N, Houssaye A. Sauropterygia: Nothosauria and Pachypleurosauria. 1st ed. Vertebrate Skeletal Histology and Paleohistology. 1st ed. Boca Raton: CRC Press; 2021. pp. 435–443. https://doi.org/10.1201/9781351189590-21
- 137. Nakajima Y, Hirayama R, Endo H. Turtle humeral microanatomy and its relationship to lifestyle. Biol J Linn Soc. 2014;112: 719–734.
- 138. Amson E. Humeral diaphysis structure across mammals. Evolution. 2021;75: 748–755. pmid:33433007
- 139.
Lefebvre R, Allain R, Houssaye A. What’s inside a sauropod limb? First three‐dimensional investigation of the limb long bone microanatomy of a sauropod dinosaur,
*Nigersaurus taqueti*(Neosauropoda, Rebbachisauridae), and implications for the weight‐bearing function. Palaeontology. 2023;66: e12670. - 140. Domning DP, de Buffrénil V. Hydrostasis in the Sirenia: quantitative data and functional interpretations. Mar Mammal Sci. 1991;7: 331–368.
- 141.
Buffrénil V de, Laurin M, Jouve S. Archosauromorpha: the Crocodylomorpha. Vertebrate skeletal histology and paleohistology. New York: CRC Press; 2021. pp. 486–510.
- 142. Klein N. Long bone histology of sauropterygia from the lower Muschelkalk of the Germanic basin provides unexpected implications for phylogeny. PLOS ONE. 2010;5: 1–25. pmid:20657768
- 143.
Krahl A, Klein N, Sander P. Evolutionary implications of the divergent long bone histologies of
*Nothosaurus*and*Pistosaurus*(Sauropterygia, Triassic). BMC Evol Biol. 2013;13: 123. pmid:23773234 - 144.
Klein N, Griebeler EM. Bone histology, microanatomy, and growth of the nothosauroid
*Simosaurus gaillardoti*(Sauropterygia) from the Upper Muschelkalk of southern Germany/Baden-Württemberg. Comptes Rendus Palevol. 2016;15: 142–162. - 145. Russell DA. Isolated dinosaur bones from the middle cretaceous of the Tafilalt, Morocco. Bull Muséum Natl Hist Nat Sect C Sci Terre Paléontol Géologie Minéralogie. 1996;18: 349–402.
- 146. Canoville A, Schweitzer MH, Zanno LE. Systemic distribution of medullary bone in the avian skeleton: ground truthing criteria for the identification of reproductive tissues in extinct Avemetatarsalia. BMC Evol Biol. 2019;19: 71. pmid:30845911
- 147. Klein N, Canoville A, Houssaye A. Microstructure of vertebrae, ribs, and gastralia of Triassic Sauropterygians—new insights into the microanatomical processes involved in aquatic adaptations of marine reptiles. Anat Rec. 2019;302: 1770–1791. pmid:30989828
- 148. Houssaye A, Scheyer TM, Kolb C, Fischer V, Sander PM. A new look at ichthyosaur long bone microanatomy and histology: implications for their adaptation to an aquatic life. Farke AA, editor. PLOS ONE. 2014;9: e95637. pmid:24752508
- 149. Houssaye A, Tafforeau P, de Muizon C, Gingerich PD. Transition of Eocene whales from land to sea: evidence from bone microstructure. Beatty BL, editor. PLOS ONE. 2015;10: e0118409. pmid:25714394
- 150. Amson E, Kolb C. Scaling effect on the mid-diaphysis properties of long bones—the case of the Cervidae (deer). Sci Nat. 2016;103: 58. pmid:27350329
- 151. Dechaume-Moncharmont F-X, Monceau K, Cezilly F. Sexing birds using discriminant function analysis: a critical appraisal. The Auk. 2011;128: 78–86.
- 152. Wagenmakers E-J, Farrell S. AIC model selection using Akaike weights. Psychon Bull Rev. 2004;11: 192–196. pmid:15117008
- 153. Adolfsson A, Ackerman M, Brownstein NC. To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit. 2019;88: 13–26.
- 154.
Redelstorff R, Sander PM. Long and girdle bone histology of
*Stegosaurus*: implications for growth and life history. J Vertebr Paleontol. 2009;29: 1087–1099. - 155.
Balanoff AM, Xu X, Kobayashi Y, Matsufune Y, Norell MA. Cranial osteology of the theropod dinosaur
*incisivosaurus gauthieri*(Theropoda: Oviraptorosauria). Am Mus Novit. 2009;2009: 1–35. - 156.
Woodward HN, Lehman TM. Bone histology and microanatomy of
*Alamosaurus sanjuanensis*(Sauropoda: Titanosauria) from the maastrichtian of Big Bend National Park, Texas. J Vertebr Paleontol. 2009;29: 807–821. - 157. Laurin M, Girondot M, Loth M-M. The evolution of long bone microstructure and lifestyle in lissamphibians. Paleobiology. 2004;30: 589–613.
- 158.
Brown JH. Macroecology. University of Chicago Press; 1995.
- 159.
Gaston K, Blackburn T. Pattern and process in macroecology. John Wiley & Sons; 2008.
- 160. Connolly SR, Keith SA, Colwell RK, Rahbek C. Process, mechanism, and modeling in macroecology. Trends Ecol Evol. 2017;32: 835–844. pmid:28919203