Deep learning to enable color vision in the dark

Humans perceive light in the visible spectrum (400-700 nm). Some night vision systems use infrared light that is not perceptible to humans and the images rendered are transposed to a digital display presenting a monochromatic image in the visible spectrum. We sought to develop an imaging algorithm powered by optimized deep learning architectures whereby infrared spectral illumination of a scene could be used to predict a visible spectrum rendering of the scene as if it were perceived by a human with visible spectrum light. This would make it possible to digitally render a visible spectrum scene to humans when they are otherwise in complete “darkness” and only illuminated with infrared light. To achieve this goal, we used a monochromatic camera sensitive to visible and near infrared light to acquire an image dataset of printed images of faces under multispectral illumination spanning standard visible red (604 nm), green (529 nm) and blue (447 nm) as well as infrared wavelengths (718, 777, and 807 nm). We then optimized a convolutional neural network with a U-Net-like architecture to predict visible spectrum images from only near-infrared images. This study serves as a first step towards predicting human visible spectrum scenes from imperceptible near-infrared illumination. Further work can profoundly contribute to a variety of applications including night vision and studies of biological samples sensitive to visible light.


Supplementary Training details and models evaluation
For all the experiments, we divided the dataset into 3 parts and reserved 140 images for training, 40 for validation and 20 for testing. To compare performances between different models, we evaluated several common metrics for image reconstruction including Mean Square Error (MSE), Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), Angular Error (AE), DeltaE and Frechet Inception Distance (FID). FID is a metric that determines how distant real and generated images are in terms of feature vectors calculated using the Inception v3 classification model [1]. Lower FID scores usually indicate higher image quality.
We aimed at selecting the metric that reflects human perception the best for our task. For this, we calculated all the above-mentioned scores and visually inspected the results. All models in our experiments were trained for 100000 iterations with a learning rate starting at 1 × 10 −4 and cosine learning decay using randomly cropped patches of the size 256 × 256 and normalization to [−1, 1]. Given the fully convolutional nature of the proposed architectures, the entire images of size 2048 × 2048 were fed for prediction at inference time. As a loss function for neural networks, i.e. U-Net and U-Net-GAN generator, we used mean absolute error (MAE).
In the next sections we will describe the experimental settings and provide the graphs with metrics, which helped us to identify the best image evaluation metric and best model.

Model selection
To reconstruct RGB images from individual or combinations of near-infrared illumination we tried the following four modifications of UNet-like models: U-Net inspired CNN, U-Net inspired CNN with ImageNet pretrained weights, U-Net augmented with adversarial loss (model similar to Pix2Pix [2]), U-Net augmented with adversarial loss with ImageNet pretrained weights.
The following graphs were obtained using validation dataset.     ImegeNet weights did not significantly alter the models without pretraining. A possible explanation could be the difference in the domain of the ImageNet dataset, which does not contain a lot of human images, while human portraits predmonated the current study's dataset. Therefore, we elected to use the simpler training settings without including ImageNet pretrained weights. However, it is not clear which model, UNet or UNet-GAN performs better as the metrics gave very controversial results. Therefore, we visually inspected the patches of UNet and UNet-GAN and compared them with the ground truth (Fig. 7). Fig. 7 demonstrates that UNet produced a blurry result, and the patch from UNet-GAN almost perfectly reconstructed the ground truth. The metric that was most correlated with our conclusions was FID, therefore, we used it as the major metric and reported it in the main paper. Combining everything together, we picked UNet-GAN without ImageNet weights and used FID as the guiding metric for quality of image reconstruction. Moreover, the minimum is reached when the model was trained up to 80K iteration.

Wavelength selection
For our experiments we had three infrared images with illumination wavelengths of 718, 777, 807 nm. To determine optimal visible wavelength image reconstruction using infrared inputs, we evaluated image reconstruction for all single wavelengths, their pairwise combinations and a combination of all three infrared wavelengths. The evaluation was performed using the validation dataset. We also wanted to verify that the best performing model was at 80K iterations.   FID identified that three wavelengths gave the best result when training occurred for 80K iterations. We used these parameters for our final evaluations on the test dataset and comparison to the baseline linear regression model.

Evaluation on the test dataset and comparison with the baseline
Our final step was evaluation on the test dataset and compare it with the baseline linear regression model. We picked our best model at 80K iteration, its counterpart without adversarial loss, i.e. UNet, again at 80K, and linear regression. The following Figures report the scores on all the metrics but we were focused only on FID.    To demonstrate model performance in relation to the quantitative results, Fig. 20 provides a representative example of trained model output when using three infrared wavelength inputs to predict the ground truth visible spectrum image. It is evident that the simple linear regression model produces images with color features not similar to the ground truth. In contrast, the deep architectures better captured the colors of the target RGB ground truth image. While the UNet and UNet-GAN reconstructions appear similar to each other when viewed as a gross image, the patch analysis shown in Fig. 7 demonstrates the superiority of the adversarial network which is also reflected by the lower FID score (Fig. 19).   21 shows arithmetic differences between ground truth and predicted images, which further solidifies the quality of our predictions. The Arithmetic difference was computed using imageJ's image calculator function to subtract an array of predicted images from an array of ground truth images to produce an array of images where difference between the two image sets is visualized .  Fig 21. Arithmetic differences between ground truth and predicted images. The prevalence of dark colors, i.e. values close to zero, means that the predictions are very close to the ground truth.

Inference time
To decrease inference time, we tried varying the backbone of the generator and also explored different generator architectures. Table 1 shows the MSE scores and inference times for several combinations of architectures and backbones. It is evident than substituting VGG16 encoder with MobileNet did not significantly worsen the result, however, the inference time did not improve either. Although, FPN, PSPNet, and LinkNet improved (i.e decreased) the inference time, the quality of reconstructions (as measured by the MSE) also dropped. MobileNet 4.320 0.0177 ± 0.0028 193 ± 22 Substituting VGG16 network for MobileNet did not significantly change MSE score nor inference time. However, the number of parameters decreased by almost a factor of 4. Although FPN, PSPNet, and LinkNet improved (i.e decreased) the inference time, the quality of reconstructions (as measured by the MSE) also dropped. Training and testing was performed on the Natural Images dataset and MSE scores and inference times evaluated for three held-out samples.