Natural variability in bee brain size and symmetry revealed by micro-CT imaging and deep learning

Analysing large numbers of brain samples can reveal minor, but statistically and biologically relevant variations in brain morphology that provide critical insights into animal behaviour, ecology and evolution. So far, however, such analyses have required extensive manual effort, which considerably limits the scope for comparative research. Here we used micro-CT imaging and deep learning to perform automated analyses of 3D image data from 187 honey bee and bumblebee brains. We revealed strong inter-individual variations in total brain size that are consistent across colonies and species, and may underpin behavioural variability central to complex social organisations. In addition, the bumblebee dataset showed a significant level of lateralization in optic and antennal lobes, providing a potential explanation for reported variations in visual and olfactory learning. Our fast, robust and user-friendly approach holds considerable promises for carrying out large-scale quantitative neuroanatomical comparisons across a wider range of animals. Ultimately, this will help address fundamental unresolved questions related to the evolution of animal brains and cognition.

Response.The staMsMcal analysis between colonies is described in lines 545−548 which reads: "To analyse the inter-colonial variaMons of brain volume, we conducted linear mixed models (LMMs) using the lme4 package [686], with hive as fixed effect and populaMon as random factor.LMMs were followed by F-tests to test the significance of fixed categorical variables using the anova funcMon in the car package [697]."See also lines 362−364: "In our study, hHoney bees in the nine colonies exhibited overall similar average brain volumes Fig 6A and Table 1), hence, inter-colony differences do not explain the substan4al (32%) inter-individual variability."Reviewer Comment C1.6.It is not very clear when the authors talk about sets of images, 3D images, 2D images, or scans; e.g., in line 134 they talk about different numbers of images, in line 141 about scans, in line 163 about 3D images.Even though they are all the same, it is becer to keep one descripMon for the readers' benefit.
Response.Thank you for your comment.To enhance clarity, we now use the terms "3D image", or "three-dimensional image" when it follows a numeric value, throughout the manuscript.In rare cases where we refer to a slice (i.e., a 2D image) extracted from a 3D image, we now explicitly indicate it.
Reviewer Comment C1.7.The authors menMon in line 178 that their automaMc segmentaMon is highly accurate, which is too subjecMve a statement.12.5% of an error greater than 4% is higher than the 9.4% with very few errors in bumblebees.That, I believe, comes as a result of an overficed network due to the lack of training data.I would suggest the authors used transfer learning algorithms to retain the weights from honeybees and refine them for the bumblebees.In that way their network will be more generalized.
Response.Thank you for this important comment.We have weakened the statement of the headline (lines 218−219) as follows: "Accurate aAutoma4c segmenta4on of bee brains with no4ceable performance varia4ons in bumblebees is highly accurate" While the segmentaMon accuracy is indeed lower in bumblebees compared to honeybees, it is important to note that this discrepancy arises from the lower image quality of the bumblebee data, rather than overfiing.To illustrate this, we have included S2 Fig, which showcases the segmentaMon results and errors of specific bumblebee images.Regarding the size of training data, it is important to consider that we uMlize 3D patches extracted from the 3D volumes, resulMng in a substanMal number of training patches.Each 3D image, scaled to 256x256x256 voxels, yields 343 patches of 64x64x64 voxels based on a stride size of 32 voxels.For the 13 volumetric training images of bumblebee brains, this amounts to a total of 4,459 individual training patches.These 3D patches are then uMlized to train the neural network.
During inference, overlapping 3D patches of 64x64x64 voxels are employed to generate the 3D segmentaMon result.Both the standard accuracy and the Dice score stay close to their maximums (maximum standard accuracy aoer 154 epochs for bumblebees and aoer 197 epochs for honey bees, maximum Dice score aoer 129 epochs for bumblebees and aoer 123 epochs for honey bees) even up to 200 epochs (S4−7 Figs), indicaMng that overfiing is not a concern.Although, it is true that the loss funcMon slightly starts to increase from around 100 epochs (S5  The training set is small considering the number of epochs.As stated in my previous comment, the presented training is not opMmimal.
Reviewer Comment C1.9.Also, why the variaMon set is larger than the training set?How did the authors choose to split their data?
Response.The decision to have a larger validaMon set compared to the training set was made to align with the process used in analyzing our bee brains and what a typical Biomedisa user would do.We added the following paragraph to the "Increasing the number of 3D training images" secMon (lines 606−611): "While the conven4onal prac4ce o\en involves alloca4ng approximately 80% of the data for training and 20% for tes4ng, or using a split of 60% for training and 20% each for valida4on and tes4ng, we opted for a training dataset size that aligns with the analysis procedure for our specific bee brain dataset.Thus, we used up to 26 CT scans for training, selec4ng a similar amount for valida4on, considering the compute-intensive nature of valida4on during training, and assigned the remaining 3D images for tes4ng."Reviewer Comment C1.10.It is common to choose the values of hyperparameters based on the accuracy and performance of the network during training and validaMon.Can the authors provide such data?
Response.Biomedisa follows the idea of a simple and out-of-the-box tool.Therefore we chose the standard configuraMon of Biomedisa as described in the manuscript.They have proven to perform well on diverse image datasets.Although researchers can adapt the parameters to their own needs, we wanted to demonstrate the process based on the standard configuraMon without opMmizaMon making it simple to use.Nevertheless we checked the accuracy for varying learning rate, batch size, and network size and included this informaMon in the new S2 Table .The accuracy for varying epochs can now be found in S4 & S6 Figs, which have been added to the SupporMng InformaMon and discussed in detail in the secMon "Accurate automaMc segmentaMon of bee brains with noMceable performance variaMons in bumblebees".Most interesMngly, choosing a smaller network size (up to 5 blocks with a maximum of 512 channels for the deepest layer) does not give staMsMcally significant worse results but reduces computaMon Mme from 13 to 10 hours using 4 NVIDIA V100.
Even though I understand the Biomedisa contribuMon on this study, the focus of the specific paper is the use of deep learning methods for the segmentaMon of micro-CT images of bee brains.As such, Biomedisa should be a tool for pre-processing that can be skipped if someone chooses other manual segmentaMon methods for the creaMon of the masks.I think the authors should focus on the network and its opMmizaMon rather that choose the standard configuraMon of Biomedisa.Reviewer Comment C1.12.An ROC curve is essenMal to check the overfiing or underfiing of the network.Response.The Receiver OperaMng CharacterisMc (ROC) curve is not typically used for measuring accuracy in image segmentaMon tasks.The ROC curve is commonly employed in binary classificaMon problems where the goal is to disMnguish between two classes, such as detecMng whether an image contains a specific object or not.Image segmentaMon, on the other hand, involves dividing an image into meaningful regions or segments.Accuracy in image segmentaMon is typically evaluated using metrics that assess the quality of the segmentaMon masks, such as IntersecMon over Union (IoU), also known as the Jaccard index, or the Dice coefficient.These metrics measure the overlap between the predicted segmentaMon and the ground truth segmentaMon.The ROC curve is not directly applicable to segmentaMon because it requires a binary classificaMon output.In image segmentaMon, the output is a pixel-wise mask or a set of regions, rather than a binary classificaMon label.To evaluate the performance of an image segmentaMon model, it is more appropriate to use metrics like IoU, Dice coefficient, or pixel-wise accuracy, which focus on the quality of the segmentaMon itself rather than classifying individual pixels as posiMve or negaMve.The basic problem of applying ROC to image segmentaMon is that you take into account the background but you always have a huge amount of background labeled correctly resulMng in AUC ooen of 1.0.In contrast, the Dice score does not consider the background but the overlap of a brain area with its ground truth making it much more reliable.

Reviewer
Image segmentaMon is an assignment of labels to pixels to create subgroups of the same important elements in the pixels.The output is a matrix of elements that specify the object class.Even though in the specific segmentaMon the authors chose mulM-class segmentaMon, the ROC curve could be calculated using the different segmentaMon thresholds as in binary segmentaMons.

Reviewer Conclusion.
Based on all these points, I think the manuscript needs to be significantly revised to warrant publicaMon in PLOS ComputaMonal Biology.
Fig).It is important to note that we always used Biomedisa's standard configuraMon to demonstrate what users can expect without opMmizing parameters to their specific dataset.While transfer learning is parMcularly valuable when training data is limited, for the bumblebee dataset we conducted a comparison between fine-tuning the network trained on the honeybee dataset, following Amiri et al. 2020 (keeping the decoder part fixed and only training the encoder part), and training from scratch using different numbers of training images (S3 Fig).As expected, fine-tuning was becer when the number of training images was small (3 to 7 training images) but not when more than 12 training images were uMlized.It is worth noMng that despite the similarity in brain shape, the bumblebee CT scans differ significantly from honeybee images.In parMcular in terms of contrast between brain areas, especially concerning CX (S2 Fig).

From
Fig S4 we can see that aoer almost 100 epochs the training had no meaning, as the accuracy stayed the same and there is no significant difference in the test accuracy.Moreover, the figure is an indicator of overfiing, as the gap in accuracy of tesMng and training set is big.I would suggest the authors stopped the training aoer 100 epochs.Fig S5 also shows that aoer 100 epochs the network is overfiing, as the gap in loss is expanding.Reviewer Comment C1.8.It is unclear why the authors chose 200 training epochs in such a small sample.Response.The sample size, as menMoned earlier, is not small.Although the accuracy remains mostly unaffected, employing 200 epochs might prove inefficient.The decision to use 200 epochs was made to ensure thorough training of the network as now menMoned in lines 270−271.However, Biomedisa provides the opMon of early stopping to prevent unnecessarily prolonged training.
Comment C1.11.How was the loss changing during training?Response.The loss for the bumblebee dataset can be found in S5 Fig and the loss for the honey bee dataset in S7 Fig.The minima are aoer 34 epochs and 62 epochs, respecMvely.
Comment C1.13.In S1 Fig. it is unclear whether the network overpredicted the areas or mislabelled them.It would be much becer if the authors provided the ground truth image too.Response.We added the outline of the ground truth data to the right column of S1 Fig and added a similar figure for the bumblebee dataset (S2 Fig).While there is some overpredicMon, the errors occurring are most ooen mislabelling.