Fig 1.
One of the two settings for photography.
A proper holder was used to attach a high-end (VIVO-V21) camera phone equipped with a clip-on macro lens and an LED ring light. A mosquito was placed approximately 7 cm (the shortest focusable distance) below the phone’s camera. The smartphone was triggered by a remote control in order to minimize vibration and ensure high-quality photos. In another setting, mosquitoes were photographed from a variety of angles using a midrange (Samsung Galaxy A52s) or subpar (Vivo Y21) camera phone without the aid of holders or lighting. Due to blurring caused by hand movement, shading caused by flash lighting, or higher noise caused by a low-quality sensor, photos taken in such a way turn out to be of low quality.
Fig 2.
Cropped and resized mosquito images.
Having the correct setting and a high-quality camera, the 512x512 images are of excellent quality (LHS). Without a holder, adequate lighting, or a camera of sufficient quality, the resulting image may be blurry with high shading and noise levels (middle). Blurry photographs reduce the accuracy of all classification models (RHS). The picture on the right was purposely blurred from the one in the middle.
Table 1.
Biological mosquito species, collection sites, and the number of females/images taken in various lighting conditions by three different smartphones.
Fig 3.
It is composed of five convolution blocks, each followed by a max pooling layer, and a block of three fully-connected layers.
Fig 4.
The conventional dropout (left) blocks some nodes (gray squares) in 3-D randomly; the spatial dropout (right) selects at random some channels first and blocks all the nodes in the chosen ones.
Fig 5.
Modified from the VGG16, a spatial dropout layer is inserted after each of the five max pooling layers.
Fig 6.
Adapted from the spatial dropout, channels from two images can be combined by interleaving the odd channels from one image, and the even one from the other.
Fig 7.
Architecture of early-combined model (Model (A)).
Two branches from two image inputs are combined just after the first convolution block.
Fig 8.
Architecture of middle-combined model (Model (B)).
Two branches from two image inputs are combined after the third convolution block.
Fig 9.
Architecture of late-combined model (Model (C)).
Two branches from two image inputs are combined after the fifth convolution block.
Fig 10.
It is composed of three deep learning models: Models (A), (B) and (C), which attempt to predict the species from three image pairs from three input images, and an ensemble model (Model (D)), which combines and yields the optimal classification result.
Table 2.
Mean and standard deviation of classification accuracy of the reference model in comparison to those of the SDVGG16 model with the dropout rate from 0.1 to 0.6 on the reference dataset.
Table 3.
Mean and standard deviation of classification accuracy of the reference model in comparison to those of our multi-view models on the reference dataset.
Table 4.
Mean and standard deviation of classification accuracy of our multi-view models when trained on 80% high-quality images and tested on 20% high-quality images.
Table 5.
Mean and standard deviation of the classification accuracy of our proposed models, which were trained on all high-quality images but tested on 90% low-quality images.
Table 6.
Mean and standard deviation of the classification accuracy of our proposed models, which were retrained on 80% of high- and low-quality images and tested on remaining 20% of high- and low-quality images.
Table 7.
Mean and standard deviation of the classification accuracy of our proposed models, which were trained on all high-quality images but tested on 90%low-quality images, one input was blurred intentionally.
Table 8.
Mean and standard deviation of the classification accuracy of our proposed models, which were trained on all high-quality images but tested on 90%low-quality images, two inputs were blurred intentionally.