Identifying geographically differentiated features of Ethopian Nile tilapia (Oreochromis niloticus) morphology with machine learning

doi:10.1371/journal.pone.0249593

Table 1.

Six Ethiopian lakes which constitute the origin of 209 Nile tilapia specimen.

More »

Expand

Fig 1.

Data analysis of Nile tilapia images.

The objective of our approach is to provide robust conclusions by randomly reshuffling the original data and repeating analysis ten times. The input features considered in our analysis are 1) established landmarks of fish morphology which underwent a generalized Procrustes analysis (GPA); 2) Gaussian process latent variable representations of fish images (GP-LVM) and 3) features extracted by deep convolutional neural networks (CNN). Nile tilapia origin is classified from GPA and selected GP-LVM features with Gaussian process classifiers (GPC) and Bayesian multi layer perceptrons which were inferred with a hybrid Monte Carlo algorithm (MLP). Combining feature extraction and classification, the CNN is directly applied to Nile tilapia images. Unbiased predictions were obtained with ten fold cross testing and assessed by generalization accuracy (Acc) and mutual information between features and labels (MI). McNemars test was used to assess the differences between the least performing classifier and all improved results for statistical significance (Sig). ARD level based feature ranks allow us to reason which visual features of Nile tilapia adapt to habitat.

More »

Expand

Fig 2.

Landmark positions on an image of a Nile tilapia fish.

A classical morphometrics analysis of Ethiopian Nile tilapia is based on the 14 landmark positions which we illustrate in this image of a Nile tilapia fish.

More »

Expand

Table 2.

Characterization of the 14 landmark positions in Fig 2.

More »

Expand

Fig 3.

Sketch of the CNN architecture we use for predicting the lake of origin from 224 gray scale Nile tilapia images.

Our architecture uses 16 layers. Except for the fully connected layers which use 1024 nodes and a 6 class softmax output layer, we apply the Keras version of VGG-16 by [24] which is inferred on ImageNet data [52].

More »

Expand

Fig 4.

Scatter plot of GPA transformed landmark positions.

The coordinates of the landmarks listed in Table 2 are processed with a generalized Procrustes analysis. After color coding by lake, the 14 two dimensional landmark coordinates of all Nile tilapia samples are visualized as dots. The solid line connects the center locations of all landmarks.

More »

Expand

Fig 5.

Visualization of GP-LVM projections.

A scatter plot matrix provides pairwise impressions of the latent dimensions F0 to F3. Coloring by lake shows that dimension F0 separates lake Chamo from the other lakes while dimensions F1 and F2 show distinct values for lakes Koka and Tana.

More »

Expand

Table 3.

Average performance metrics of all tested classifiers.

More »

Expand

Fig 6.

Box plots of generalization metrics that were obtained by resampling.

The metric values which give rise to these box plots were obtained by ten fold cross testing after reshuffling the samples ten times. Every box illustrates the performance characteristics of a distinct combination of input feature and classifier. A description of the acronyms is provided in the legend of Table 3 and in the text. a) illustrates generalization accuracies in percent. b) illustrates the distribution of Mutual information and c) visualizes the McNemar p-values on a logit scale when comparing GPA+GPC against the other analysis pipelines.

More »

Expand

Fig 7.

Variation of GP-LVM feature dimensions in image space.

Image color is obtained by mapping variation of individual GP-LVM dimension to image coordinates. Low variation is represented as blue color. Intermediate variation has yellow color and red represents image regions which correspond to large variation in GP-LVM space. Odd rows display the entire saliency map. Even rows retain only highly relevant pixels which are by Eq (5) assessed as significant (p-value <0.001). Fig 6a) illustrates 12 GP-LVM dimensions which represent genuine fish morphology or skin markings and allow to draw valid conclusions about location specific adaptation. Fig 6b) illustrates 4 rogue GP-LVM feature dimensions which represent technical artifacts. The improved performance we observe in Table 3 for Top GP-LVM inputs is thus aided by technical artifacts which are correlated with the class label lake.

More »

Expand

Fig 8.

LPR diagnostic plots for highly representative Nile tilapia samples.

This figure illustrates fish images, LPR saliency maps and diagnostic plots which display fish images and red color mask to highlight image features which contribute significantly to the prediction. The chosen samples are highly representative for the respective lake of origin. While the predictions of the samples for lakes Hawassa, Koka and Ziway are fine, the predictions for lake Chamo and Langano use information in the image background. The prediction of the lake Tana sample is aided by the presence of the fixation pin.

More »

Expand

Fig 9.

Selected LPR diagnostic plots with visible contamination by technical artifacts.

This figure illustrates fish images, LPR saliency maps and diagnostic plots which display fish images and red color masks to highlight significant image features. All images contain indications that genuine Nile tilapia body features are considered relevant for predicting the lake of origin. It is however also obvious that features in the image background, dirt particles and the presence of the fixation pin provide rogue information which aids predicting the respective lake of origin.

More »

Expand

Fig 10.

ARD level based ranking of GPA-shape features.

The pie charts visualize for ranks one to five the relative number of occurrences of a landmark being ranked at the respective position when repeating inference ten times on reshuffled data. The graph in a) illustrates the ranks which we obtain with GPC based ARD levels. The graph in b) illustrates the ranks which we obtain with HMC-MLP based ARD levels. For improved reproducibility we look for agreement between both rankings to conclude that the upper tip of the snout (UTP), the posterior end of the mouth (EMO), the anterior insertion of the dorsal fin (AOD), and the posterior insertion of the dorsal fin (POD) show signs of differentiation. This suggests that dorsum and snout are affected the most from adaptation.

More »

Expand

Fig 11.

ARD level based visualizations of GP-LVM features in image space.

This figure illustrates Nile tilapia body regions which are indicative for sample origin. Regional importance is visualized as color transition with blue indicating little importance, yellow indicating intermediate importance and red indicating high importance. Importance of image regions combines the GP-LVM variation maps in Fig 7 with the ARD level based ranks which obtain by classifying the lake of origin from 14 selected GP-LVM dimensions with Bayes inferred GPC and HMC-MLP. Odd rows in Fig 7 show the importance levels of the top two rank positions. Even rows focus attention on image regions which are by Eq (5) assessed as significant (p-value <0.001). Visualizations are tagged by the classification procedure which provides the ARD levels for ranking. With red indicating important image regions, the visualizations allow the conclusion that the anterior dorsal region, the belly region, the posterior dorsal region and the caudal fin of Nile tilapia are indicative for the origin of fish specimen.

More »

Expand