Deep learning approaches to landmark detection in tsetse wing images

Morphometric analysis of wings has been suggested for identifying and controlling isolated populations of tsetse (Glossina spp), vectors of human and animal trypanosomiasis in Africa. Single-wing images were captured from an extensive data set of field-collected tsetse wings of species Glossina pallidipes and G. m. morsitans. Morphometric analysis required locating 11 anatomical landmarks on each wing. The manual location of landmarks is time-consuming, prone to error, and infeasible for large data sets. We developed a two-tier method using deep learning architectures to classify images and make accurate landmark predictions. The first tier used a classification convolutional neural network to remove most wings that were missing landmarks. The second tier provided landmark coordinates for the remaining wings. We compared direct coordinate regression using a convolutional neural network and segmentation using a fully convolutional network for the second tier. For the resulting landmark predictions, we evaluate shape bias using Procrustes analysis. We pay particular attention to consistent labelling to improve model performance. For an image size of 1024 × 1280, data augmentation reduced the mean pixel distance error from 8.3 (95% confidence interval [4.4,10.3]) to 5.34 (95% confidence interval [3.0,7.0]) for the regression model. For the segmentation model, data augmentation did not alter the mean pixel distance error of 3.43 (95% confidence interval [1.9,4.4]). Segmentation had a higher computational complexity and some large outliers. Both models showed minimal shape bias. We deployed the regression model on the complete unannotated data consisting of 14,354 pairs of wing images since this model had a lower computational cost and more stable predictions than the segmentation model. The resulting landmark data set was provided for future morphometric analysis. The methods we have developed could provide a starting point to studying the wings of other insect species. All the code used in this study has been written in Python and open sourced.

The authors now briefly touch on the point of "signal-to-noise ratio and data quality", but again address this very superficially with a computational mindset. My initial point was more with respect to the question of "How good do your data need to be?" Dear reviewer, Thank you for your comment, and we apologize for misunderstanding your initial comment. After doing some research, we are still uncertain about your question. Could we ask you to further elaborate on how we might find the signal to noise ratio on our data set, or how we might define how good our data is? We try to provide a response to the best of our knowledge below.
We provide all the specifications we used to capture the image and the parameters (resolution, zoom) with which to capture these images for reproducibility. The images in general are of a high resolution and can be considered to have a high signal to noise ratio. Visual inspection and subjective assessment indicate that other measures of image quality such as contrast, sharpness are generally very good, while we very rarely observe distortion or artifacts. In this paper we did not experiment with adding white noise to understand how the model performs with increased noise which may be studied in future research.  Data Description section: "Nonetheless, to further explore the variation in the whole data set, we stress the desirability of applying the technique developed here to all wings available from all 27 volumes of data. This is, however, a major undertaking and was beyond the scope of the present study." -This paragraph needs to go into discussion.
Thank you for the suggestion. The paragraph has been incorporated into the Discussion (lines 518-520) line 335 -typo "squared" should be "squared" Thank you, corrected.

Reviewer 2
I do not understand why authors did not perform the suggested, very basic, evaluation of baseline (mean position prediction for landmark detection, on the same data and protocol as line 230) without giving a convincing argument about not implementing it. Indeed it should take less than one day on the tetse_data.csv: for each landmark compute the mean coordinates values on the learning set, use these mean values as predictor, compute metrics on the test-set, report them similarly as you did for deep learning models. I still believe such a baseline is important when releasing a new dataset. This basic evaluation could strengthen the interest of using their deep learning approaches and let readers better assess how difficult is the task. Instead, authors only replied "The suggestion for trying a model that predicts the mean of landmark positions from the training set is a neat way of comparing how much better the models perform compared to the most straightforward solution. We suggest that the models we present are baseline models, i.e. a good starting point for further development".
We are thankful for the suggestion. We performed the baseline evaluation and added the results to the main text, see lines 228-229, 334-337, Figure 5 and corresponding caption.
Their answer concerning some parameter value choices (Section training implementation, data augmentation) is OK, but this could be translated to an additional sentence in the main text stating that for these parameters "we chose them based on a visual inspection of what degree of augmentation seemed reasonable" (line +/-280).
Thank you for your suggestion. We added the following sentence to the manuscript at lines 280-281: "We chose these intervals based on a visual inspection of what degree of augmentation seemed reasonable." I don't think the abstract and author's summary should mention "The methods we have developed should apply to studying the wings of any insect species" (abstract) neither "Our method applies to the study of the wings of any insect species" (author's summary) as it has not been tested (as recognized later in main text line 570-571). Moreover, although the source code is provided, it is in a state that will not facilitate direct reuse on other data as it is provided as Jupyter Notebook without library versioning (a pip requirements file would be welcome), with specific filenames explicitly mentioned in the source code, and with very little documentation. So I think applying the methodology on other data would not be straightforward.
We agree with the Reviewer and made accompanying changes in our wording throughout the manuscript.

Reviewer 3
"We agree that despite our best efforts, we might not have caught all misalignments. As the reviewer mentions, the alignment method is less robust in the specific case where pages have a small number of misalignments; hence, we are likely to detect the majority of misalignments. Furthermore, our sample statistics (lines 378-382) suggest that we could detect the expected amount of misaligned pages." I agree with the authors that having a method that performs accurately for the current task is a good solution. However, I would suggest including further information about the algorithm's limitations in the discussion so other researchers can take it into account We incorporated and clarified the limitations of the algorithm in the Discussion (lines 478-487) as follows: "Secondly, we cannot ensure that our R^2 threshold method achieves perfect alignment between the images and the biological data, therefore some misalignments might still be present. This is because the threshold method is less robust in the specific case where pages have a small number of misalignments, i.e. pages with only a few misalignments are less likely to be detected since they will most likely still have a high R^2 value. We are more likely to pick up misalignments due to photographing mistakes that happened at the beginning of a page, since this will mean the subsequent flies will not be in the correct order, resulting in a low R^2. However, the sample statistics for misaligned pages indicate that the expected amount was detected, thus the majority of misalignments were found and corrected." Figure 2: It is a supportive graphical information to understand the pipeline. If possible, I would suggest the authors switching the order of the tasks (segmentation and classification) to align with the order of the tiers.
Thank you for the suggestion. It has been done in the revised manuscript Wing reorientation: While it is true that this task is not a milestone of the proposed pipeline, the following statement could be slightly strong. This is particularly important due to the lack of numerical comparison between the non-and reorientation approaches: "The orientation of features (veins) for each semantic class would only be consistent if we trained it by first converting all images to face the same direction. If we apply the regression network on an image facing the wrong direction, it won't perform very well." CNN can learn to be invariant to translation/rotation quite easily and it is indeed a typical feature when analyzing bio-images. Namely, in this case the network sees the entire wing so it should be even easier for the CNN to identify all the landmarks. For this reason, without strong evidences, I would recommend the authors to smooth such statements.