Figures
Abstract
Prosthetic vision is being applied to partially recover the retinal stimulation of visually impaired people. However, the phosphenic images produced by the implants have very limited information bandwidth due to the poor resolution and lack of color or contrast. The ability of object recognition and scene understanding in real environments is severely restricted for prosthetic users. Computer vision can play a key role to overcome the limitations and to optimize the visual information in the prosthetic vision, improving the amount of information that is presented. We present a new approach to build a schematic representation of indoor environments for simulated phosphene images. The proposed method combines a variety of convolutional neural networks for extracting and conveying relevant information about the scene such as structural informative edges of the environment and silhouettes of segmented objects. Experiments were conducted with normal sighted subjects with a Simulated Prosthetic Vision system. The results show good accuracy for object recognition and room identification tasks for indoor scenes using the proposed approach, compared to other image processing methods.
Citation: Sanchez-Garcia M, Martinez-Cantin R, Guerrero JJ (2020) Semantic and structural image segmentation for prosthetic vision. PLoS ONE 15(1): e0227677. https://doi.org/10.1371/journal.pone.0227677
Editor: Yuanquan Wang, Beijing University of Technology, CHINA
Received: June 12, 2019; Accepted: December 24, 2019; Published: January 29, 2020
Copyright: © 2020 Sanchez-Garcia et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are available from figshare at https://doi.org/10.6084/m9.figshare.11493249.v4.
Funding: This work was supported by projects DPI2015-65962-R, RTI2018-096903-B-I00 (MINECO/FEDER, UE) and BES-2016-078426 (MINECO). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Retinal degenerative diseases such as retinitis pigmentosa and age-related macular degeneration cause loss of vision due to the gradual degeneration of the sensory cells in the retina [1, 2]. Retinal prostheses are currently the most promising technology to improve vision in patients with such advanced degenerative diseases [3–6]. These devices elicit visual perception by electrically stimulating retina cells. As a result, implanted patients are able to see patterns of spots of light called phosphenes that the brain interprets as a visual information [7–9]. Current retinal prosthetic devices are limited to hundreds of electrical receptors, which produce a very limited visual elicitation [10–12]. From the actual technologies for retinal implants [13], one of the most active line of research is based on implants with a micro camera that captures external stimuli and a processor that converts the visual information in microstimulations in the implant, as can be seen in Fig 1. Following the computer image paradigm, we can say that the visual information evoked by the implants has very low spatial resolution and very limited dynamic range (only few levels of stimulus intensity are perceived as different) [14–16]. Intuitively, from an information theory perspective, the process from the external sensor input to the implant stimuli is analogous to taking a high definition image and convert it to a low resolution, grayscale image with just a few grey levels. Thus, a large amount of visual information is lost. Prosthetic vision allows users to recognize objects with simple shapes, to see people’s silhouettes in bright light or detect motion [17], but high level tasks require more precise visual cues and a deeper interpretation of the information.
The external and internal components include a micro camera, a transmitter, a external processing unit and a implanted electrode array. First, the external camera acquires an image. Then, the external processor converts the image to a suitable pattern of electrical stimulation of the retina through an electrode array.
Recent developments in implants might result in an improved resolution and performance of the visual elicitation [18], but quality would still be several orders of magnitude lower than a current micro camera. Alternatively, the visual information gathered by the external camera could be processed prior to being transferred to the retinal electrodes. Image processing can be used to extract and highlight relevant information from the external camera. This information can be presented with visual cues that help to understand the perceived scene by the implanted subject. Several studies have already been conducted testing specific cues for object recognition [19–25], reading [26–29], facial recognition [30, 31] or navigation [32–36] in the context of prosthetic vision.
One of the most basic image processing tasks from the cognitive, but also from the computational level, is the segmentation of the image in different regions [37–39]. From a statistical point of view, this corresponds to the problem of clustering. Rooted on the Aristotelian laws of association, early research in perception from Gestalt psychologists found the importance of the principles of grouping [40]. These principles state that our brain tends to group image elements based on proximity, color, shape or other similarities. Although some of the Gestalt ideas are controverted, the principles of grouping have been supported by posterior empirical research [41, 42]. From a computational perspective, image segmentation dates back to the seminal work of Minsky and Pappert [43] followed by several works in the 60s and 70s [44, 45]. At that time, segmentation was based on grouping elements as belonging to the same object. Adelson [46] proposed to group elements based on abstract textures and materials, advocating the idea of seeing stuff rather than things. This was the stepping stone for modern semantic segmentation, where the objective is to group the image regions based on labels with semantic meaning, without relying on individual objects [47]. Furthermore, the use of semantic labels transforms the clustering problem into a classification problem. Recent research using deep learning has gone one step further to produce instance-aware semantic segmentation [48]. In this case, we are back to the concept of seeing things by grouping pixels of single objects, but including a semantic label for the object.
For visually impaired people, basic scene understanding is essential for many everyday tasks and it also facilitates subsequent tasks of finer perception. In this work, we use segmentation to provide a basic visual representation of indoor scenes for prosthetic users. We combine both semantic segmentation and instance segmentation. We use instance segmentation to highlight relevant objects in the scene. This has a double purpose: on one hand, we are able to reduce visual clutter, which becomes indistinguishable noise in a low resolution implant array; on the other hand, the grouping highlights the silhouette of the object, making it more distinguishable. One of the main problems of using object silhouettes for recognition is the lack of sense of scale or perspective. Thus, we rely on a second semantic segmentation component to extract structural informative edges of the scenes, such as wall and ceiling intersections. Those edges provide an intuitive representation of the 3D structure of the room as concluded in [49], where it is shown that the results with the structural edges are significant and better than the results obtained without edges for scene recognition. The idea of combining instance and semantic segmentation has been previously studied in the computer vision literature with different approaches [50, 51] and it has shown to be of great benefit for holistic scene understanding [52]. The limiting case where every pixel of the image has a semantic label and instance id is called panoptic segmentation [53].
Current state of the art methods for image segmentation are mostly based on deep neural networks [47]. Most recent developments of semantic and instance-aware segmentation are based on Fully Convolution Networks (FCN) [47, 48, 54–56]. A FCN is an architecture based on convolutional layers with added upsampling layers with skip connections to allow for detailed pixel prediction on arbitrary-sized inputs [57]. Similar approaches, like the U-net architecture, are able to provide accurate pixel prediction [58]. In this work, we use two different types of FCN-based segmentation to highlight the information available in the image and to present the most useful information to the user: PanoRoom [54] for semantic segmentation of structural elements and Mask-RCNN [55] for instance segmentation of relevant objects.
We evaluate and compare the proposed semantic and structural image segmentation with baseline methods through a Simulated Prosthetic Vision (SPV) experiment, which is a standard procedure for non-invasive evaluation using normal vision subjects [19–36]. The experiments included two tasks: object recognition and room identification.
Methods
Subjects
Eighteen subjects with normal vision volunteered for the formal experiment. The subjects (four females and fourteen males) were between 20 and 57 years old.
Ethics statement.
The research process was conducted according to the ethical recommendations of the Declaration of Helsinki and was approved by the Aragon Autonomous Community Research Ethics Committee (CEICA) that evaluates human research projects, human biological samples or personal data. The research protocol used for this study is non-invasive, purely observational, with absolutely no-risk for any participant. There is no personal data collection or treatment and all subjects were volunteers. Subjects gave their informed written consent after explanation of the purpose of the study and possible consequences. The consent allowed the abandonment of the study at any time. All data were analyzed anonymously.
Stimuli
We use a two step process to generate the stimuli used in the experiments. First, we process the original color image with the different methods stated in the following subsections. This generates a grayscale or binary image which corresponds to the signal to activate the electrodes of the retinal implant. Then, we use simulated prosthetic vision by generating the phosphene pattern as an image in a computer screen. The phosphene simulation has been designed to represent the descriptions of phosphene perception reported by retinal prothesis patients.
The following sections describe the three segmentation methods used in the experiments to process the input images and generate the activation of the phosphenes. First, our proposal based on semantic segmentation with artificial neural networks (SIE-OMS). To our knowledge, this is the first work that have used deep learning models in this setup. Therefore, we have used two standard processing methods as baselines: a) detecting the silhouettes and structure within the scene with a standard edge detector (Edge), and b) generating the stimulus directly from the input image luminance (Direct). Examples of the resulting effect are shown in Fig 2. For reproducibility purposes, all the stimuli images used in the experiments can be found in the dataset available online Image dataset: https://doi.org/10.6084/m9.figshare.11493249.v4.
Top row: Example of a bathroom scene with the three processing methods used in this work (a) Direct image, (b) Edge image and (c) SIE-OMS image. Bottom row: the three processing methods in the SPV.
SIE-OMS.
We propose to combine two FCNs to select and highlight informative elements in indoor scenes as an intelligent way of activating the phosphenes. Specifically, we extract structural informative edges (SIE) and object masks and silhouettes (OMS) to later combined both, SIE and OMS, to build our proposed schematic representation of the scene (SIE-OMS), as can be seen in Fig 3. This idea comes from our previous study, where the results concluded that the representation of SIE in the schematic representation of the scene is significant and produce better results in object and scene recognition for SPV than the schematic representation without edges [49].
Structural informative edges (SIE).
One of the main problems in the recognition of scene elements based on silhouettes is the lack of sense of scale or perspective. The scale and the structure of the scene can be achieved by detecting the structural informative edges (SIE), that is, those main edges formed by the intersection of the walls, floor and ceiling of the room. These edges can be seen in Fig 4. Our approach is based on the model by Fernandez-Labrador et al. [54] for indoor scenes. Similarly to the object masks network described below, this method is also based on a FCN for pixel classification. In this case, the network was trained to estimate probability maps representing the room structural edges, even in the presence of clutter and occlusions. The architecture of the network is an the encoder-decoder structure [54]. The encoder is built from a ResNet-50 model [59], pre-trained on the ImageNet dataset, with the final layer replaced with a decoder that jointly predicts layout edges and corners locations already refined. The output of the model is an unique branch whose output has two channels, corners and edges maps. In the decoder, the model employs skip-connections from the encoder to the decoder concatenating ‘up-convolved’ features with their corresponding features from the contracting part. In order to improve the training phase, Fernandez-Labrador et al. suggest to performe preliminary predictions in different resolutions which are concatenated and feed back to the network. The loss function for training is a pixel-wise sigmoid cross-entropy, regularized by the L1-norm of the network parameters. The loss function was minimized by using Adam with an initial learning rate of 2.5e−4 and exponentially decayed by a rate of 0.995 every epoch. They also applied 0.3 dropout In this context, dropout refers to the technique used in deep learning to prevent overfitting. Do not confuse with the dropout of phosphenes. rate and 5e−6 weight decay and a batch size of 16 which allowed the learning of more complex rooms. In this work, we have used a model pre-trained with the LSUN dataset [60].
Using [54] we detect the main structure of the room extracting the structural informative edges (SIE) (right) which are those formed by the intersection of walls, ceiling and floor of the room (middle).
Object masks and silhouettes (OMS).
We perform instance segmentation of objects using the original architecture of Mask R-CNN [55] which is partially represented in Fig 5. This method is an extension of Faster R-CNN [61] with several improvements and an extra branch to segment the object masks. The first part of the network, called a Region Proposal Network (RPN), proposes object bounding box candidates on the input image. These candidates are called regions of interest (ROIs). It also generates feature maps from the whole image. In our case, we used the Feature Pyramid Network (FPN) [62]. The second module is a RoIAlign layer that pools a small feature map for the object region from those extracted by the FPN. Then, it aligns each ROI to the feature map. This is the backbone architecture. Then, the model splits in two branches, as can be seen in Fig 5. The box branch is based on the classification component of Faster R-CNN. It generates two outputs for each ROI: a) the class of the object present in the ROI and, b) a refined object bounding box using a regression model. The mask branch is a convolutional neural network that takes the high probability regions selected by the ROI classifier –box branch– and generates a binary masks of the object. Then, it uses upsampling and deconvolution layers to scale the predicted masks to the size of the ROI bounding box which gives the final masks, one per object. Regarding the loss function for the model, it is composed by the total loss in doing classification, generating bounding box and generating the mask. The mask loss was defined only on positive RoIs and the mask target was the intersection between an RoI and its associated ground-truth mask. The training was performed with a batch size of 16 and for 160k iterations. The learning rate was 0.02 which was decreased by 10 at the 120k iteration. They also used a weight decay of 0.0001. For this work, we have used a pre-trained model on the COCO dataset [55, 63]. Thus, we have only considered the object classes that were already defined in the pre-trained model. In order to speed up computation and remove spurious detections we removed the object classes of clearly small objects (e.g.: scissors, banana, etc.) as the scale of the scene in the image would not have allow it to be identified, and non-indoor objects (e.g.: car, tree, etc.). Once the object masks have been generated by the network, we highlighted the contour of each mask to avoid confusion on overlapping masks, as can be seen in Fig 6. We also performed morphological operations to reduce the aliasing effect when translated to phosphene images.
Above: box branch for classification and bounding box regression. Below: mask branch for predicting segmentation masks on each Region of Interest (ROI). Numbers denote spatial resolution and channels. Arrows denote either convolutions, deconvolutions, or fully conected layers. The x4 means 4 consecutive convolution layers. (Adapted from He et al. [55]).
Object masks were generated from [55] and were sorted by probability scores to avoid occlusions between objects. The extracted information was combined in an image highlighting the silhouettes of the objects in white with the object masks in gray.
Dealing with occlusions.
Although this algorithm has achieved good results for object segmentation, there are more complicated cases, such as images with overlapping objects or scenes with occlusions, where the view of one object may be blocked by other objects. In that case, we could use a depth sensor, such as an RGB-D camera, or a stereo camera to estimate the depth. Alternatively, there are some works to estimate the depth purely, based on monocular information [64]. As a proof of concept, we found that the probability score for the detection network was correlated to the level of occlusion of each object. In non-occluded objects its form is complete and therefore its recognition is more likely detected. In contrast, the form of the occluded objects is not complete and therefore its recognition is less likely to be detected. That is, a high probability is most likely to appear in objects that are in the front. Thus, we stacked the instances from the least to the highest probability, leaving the objects with the highest score overlapping the objects with the least score, as can be seen in Fig 6. This was confirmed experimentally for our setup, which is a simple problem with a limited number of classes and scenarios. In the case of a more general environment with more classes it would be necessary to use a more complex model, but that is beyond the scope of the paper.
The final representation of the SIE-OMS method is a superposition of both parts, SIE and OMS, always assuming the edges as background and object masks as foreground.
Baseline methods.
We have considered two baseline methods that are the most used in the literature and that follow a completely different structure to our SIE-OMS model [65–68]. We compared SIE-OMS with two baseline methods used in retinal prothesis: a) a direct method that converts the input image directly to the phosphene map by averaging the brightness on the region covered by each phosphene, and b) a standard edge detector to extract brightness contours (see Fig 2). The direct method has proved to be very effective in scenes where high contrast predominates [65]. Edge detectors have also been previously used for prothesis vision and phosphene images [66–68]. Since the contours of an image holds much information, edge extraction is a useful method of encoding and selecting the information contained in an image. The drawback here is that the understanding of a complete scene in low vision represented by edges may be more challenging because the amount of clutter. For example, Sanocki et al. investigated how complicated is an edge extractor method comparing object recognition with and without removal of background clutter with edge images [69]. The results showed that the increase in the number of edges greatly increase the complexity. For the edge detector, we used the Canny implementation from the scikit image Python package with the default parameters [70]. In this case, we also added morphological operations (dilation) to reduce aliasing without adding clutter.
Phosphene simulation.
As commented before, first, the input images are processed by one of the three methods (SIE-OMS, Direct, Edge) resulting in the grayscale images from Fig 2. Then, these grayscale images are used to activate the phosphene map. In this work, we have used a simulated phosphene map on a computer screen, but the same activation images could be directly applied to the retinal implant.
Based on previous studies with simulated prosthetic vision [9, 71, 72], we approximate the phosphenes as grayscale circular dots with a Gaussian luminance profile –each phosphene has maximum intensity at the center and gradually decays to the periphery, following a Gaussian function–. The intensity of a phosphene is directly extracted from the intensity of the same region in the processed image. The size and brightness are directly proportional to the quantified sampled pixel intensities. For our phosphenic images, the array of phosphenes was limited to 32 x 32 (1024 electrodes) and 8 different luminance levels according to the number of luminance levels attainable in human trials using retinal prostheses [9, 22]. We also included a 10% dropout of electrodes, which is a standard value used in the literature [72]. The dropout percentage has shown that significantly affects the performance of recognition tasks decreasing recognition accuracy as the dropout percentage increases [73]. The complete process of phosphene generation can be found in the Supplementary material (see S1 Appendix).
Experimental setup
Most of the SPV configurations are usually based on a computer screen for the presentation of static or dynamic phosphene images [29, 74, 75]. This methodology allow controlled evaluation of normally sighted subject response and task performance which is fundamental to know the way humans perceive and interpret phosphenized renderings. SPV also offers the advantage of adapting implant designs to improve the perceptual quality using image processing techniques without involving implanted subjects. In our case, the participants were normal sighted subjects seated on a chair facing a computer screen at 1m distance resulting in a 20 degrees simulated field of view, as can be seen in Fig 7.
SPV setup: Subjects were seated on a chair facing a computer screen at 1m distance. The visual field was 20 degrees that simulates the prostheses device. Trial setup: Each gray rectangle represents the image shown on the computer monitor during the trial. Each image appeared for 10 seconds and switched for the next image automatically. Break time between image sequences was 30 seconds. The complete experiment took approximately 15 minutes.
For the formal experiment, subjects were recruited to complete two tasks: object recognition and room identification. The recognition accuracy was analyzed after the trials. Each trial consisted of a sequence of images presented randomly to the subject with the proposed SIE-OMS stimuli method and the two baseline methods (Edge, Direct), as can be seen in Fig 8. At the beginning of the experiment, a white dot was displayed in the center of the screen indicating where the subjects had to maintain the fixation sight until the beginning of the task. Next, each phosphenic image appeared for 10 seconds and switched for the next image automatically. This procedure was repeated for the other test images. To avoid distractions in the participants, they verbally indicate the type of objects seen in each image and their selection of room type keeping the fixation sight on the screen. The responses of each image were annotated by the experimenter. If the subjects did not respond within the 10 seconds that the image is displayed, the result of the test image was considered not answered (NA). If the subjects were only able to respond to one of the two phases of the experiment, only the unanswered phase was considered as not answered. Every 15 test images we made a pause of 30 seconds. The complete experiment took approximately 15 minutes.
Six examples of indoor environments represented with 1024 phosphenes (rows: bathroom, bedroom, dining room, kitchen, living room and office, respectively). Each column shows: a) input images, b) images processed using the Edge method, c) images processed using the Direct method and d) images processed by our SIE-OMS method, respectively.
The experiments were conducted using a public database of indoor scenes [76]. All the images from the database are still life scenes, from arbitrary scenarios, locations, clutter, cameras and lightning conditions. Some images are from old phone cameras with very poor quality and resolution to be more challenging as a computer vision benchmark. Thus, we replaced some images with the first results of querying Google Images with the room label, that also matched the database features (e.g., still life, mid-wide view…). For each of the six categories, we randomly selected 50 images. Hence, we conducted the experiment using 300 images from different indoor environments. The original images were processed using our proposed method and the two baseline approaches, resulting in 900 phosphenic images. Prior to beginning the experiment, subjects were informed about the number of images in the experiment (54 images per subject). Subjects were unaware that multiple image processing strategies were used in the experiment, although a screen with four images were shown to the subjects at the beginning of the test. These demo images were not included in the experiment, to avoid learning effects. Subjects were informed that all scenes were indoor scenarios, but they were not informed about the types of room, neither the object classes, nor the number of objects in each image. The types of room studied were: bathroom, dining room, living room, kitchen, office and bedroom. No subject identified a type of room or scene not belonging to that list. In most of the tests, the objects identified by the subjects were: chair, table, couch, toilet, bath, sink, bed, oven/microwave, refrigerator, laptop. This coincides with the list of classes used for our SIE-OMS which was selected without looking at the database and before conducting any trial or test. As commented before, the object classes were those already included in the pre-trained model. However, in two images with the direct method, a couple of subjects were able to find a window that our system did not detect because the class was not included. Furthermore, in a couple of cases a subject wrongly identified wardrobe and door in images containing a fridge.
Results
The following section shows the results of the experiment. We analyze separately the results of object recognition phase and room identification phase. We show the percentage of correct responses in both tasks and we include 95% confidence intervals. We also differentiate between incorrect response and no answer.
Comparison of stimuli generation methods
Table 1 and Fig 9 show the global results for object recognition and room identification tasks considering the proposed stimuli generator (SIE-OMS) and the two baseline methods (Edge and Direct). The analysis of the average correct responses for both tasks reveals a significant difference between methods (p< 0.001). In both tasks, the results show a considerably better performance of SIE-OMS compared to the other methods. The SIE-OMS method has the highest percentage of correctly identified objects (62.78%) compared to Edge (19.17%) and Direct (36.83%) methods. Likewise, there is a clear increase in the percentage of success in the room identification of SIE-OMS versus Edge and Direct method. The number of unanswered responses for our method was also smaller, indicating that there was no difficulty in the comprehension of most of the images. In contrast, it is worth noting the high percentage of unanswered responses for the Edge method, reaching more than 70% of the scenes.
Comparison of mean responses and standard deviation grouped by type of phosphenic image method (Edge, Direct and SIE-OMS). 95% of confidence interval for the mean difference.
Percentage of correct, incorrect and not answered responses in a single trial. Higher scores in correct responses indicate that subjects were able to identify and recognize the objects and the type of room in each test image. Higher ratios of not answered indicate that subjects were not able to identify and recognize the objects and the type of room in each test image. The general findings are that: SIE-OMS method improves the identification of the objects resulting to be the most effective method. This translates in an increase in the number of correct answers for the room type identification test for the SIE-OMS method. Results also show that the Edge method is the least effective with the highest percentage of non responses images for the two tasks. The test found significant difference between SIE-OM and Direct method (p<.001). The same conclusion was found between SIE-OM and Edge method (p<.001). Where: *** = p<.001; ** = p<.01; * = p<.05; ns = p>.05. All t-tests paired samples, two-tailed.
Figs 10 and 11 show the results for the object recognition and room identification tasks for each room-type, respectively. As before, when comparing the baseline methods versus our approach, the highest number of correct responses is obtained for SIE-OMS method for all room types. Besides, the largest difference in results was obtained comparing the Edge method versus the SIE-OMS method (p<0.001) for all room-types. However, there was no significant difference for kitchen type in Direct vs SIE-OMS (p = 0.464). Similarly, there is a significant difference for living room (p<0.05), office (p<0.01) and bathroom (p<0.01). On the other hand, the results of room identification task for each room-type (Fig 11) provide additional support for the SIE-OMS method since this method also has the best percentage of correct responses in each room-type, exceeding 85% for the cases of bedroom and dining room. In the same way as in the identification of objects, the case of the kitchen obtained the worst results, followed by the office case. Taken together, these findings indicate that SIE-OMS method was significantly effective improving object recognition and room identification, yet also significantly more effective than the baseline methods, Edge and Direct.
Higher scores in correct responses indicate that subjects were able to recognize the objects in each room. Higher ratios in non responses indicate that subjects were not able to recognize the objects in each room. The SIE-OMS method obtained the highest score of the three methods in all room types compared with Edge and Direct methods. The results also show how the most difficult room was the kitchen. *** = p<.001; ** = p<.01; * = p<.05; ns = p>.05. All t-tests paired samples, two-tailed.
Higher scores in correct responses indicate that subjects were able to recognize the type of room in each test image. Higher ratios in non responses indicate that subjects were not able to recognize the type of room in each image. The SIE-OMS method obtained the highest score of the three methods in all room-type compared with Edge and Direct methods. In the same way as in the identification of objects, results also showed how the most difficult room was the kitchen. *** = p<.001; ** = p<.01; * = p<.05; ns = p>.05. All t-tests paired samples, two-tailed.
Fig 12 shows four examples of failed and successful tests from the three methods. The two top rows show a bathroom and a bedroom scene where the identification of the objects and room was a success for all the methods. This is due to the location of a characteristic object with a clear silhouette in the center of the image that also helps in the identification of the room. Contrary, the bottom rows show a kitchen and an office where the recognition of the objects and the identification of the type of room failed in all cases as a result of the lack of distinguishable shapes (rectangle silhouettes) and visual clutter.
Some examples of phosphenic images generated with the three methods. Successful images (top rows) and cases of images failed by the subjects (bottom rows) with the three approaches: Edge, Direct and SIE-OMS, respectively.
Performance analysis of SIE-OMS
We also analyzed the performance of the proposed SIE-OMS method. The SIE-OMS system detected all the clearly visible objects of the scenes and even most of the occluded objects that matched the selected classes. Structural edges also improve the performance of our method. Recovering the main structure of the room provide sense of scale or perspective of the objects and hence a better understanding of the 3D scene. Table 2 shows the confusion matrix of room-type based on answered images (correct and incorrect responses). Table 3 shows the confusion matrix of the room-type based on the total images of the test (correct, incorrect and no answer).
Concerning the performance of our method, the recall and the precision are high, reaching in some cases up to 97% (Table 2). The diagonal elements show the number of correct classifications for each class. Hence, most of confusions are found in bathroom, living room and office. Office was confused with bathroom because of the similarity of shape between some chairs and toilets. In addition, office was confused by bedrooms since many of them usually have study desks in the bedrooms. There were other less relevant cases where the dining room was confused with an office since both are composed of chairs and tables. This confusion can be explained because the database is from Northamerican locations, while the subjects live in Spain where apartments commonly join the dining room and the kitchen.
Note that when the unanswered responses are taken into account (Table 3), the recall for the kitchen case decreased significantly (from 88.89% to 48.00%). This means that the kitchen room is more difficult to be identified. This low performance in the kitchen identification is mainly because the information provided turned out to be very limited in this case. For instance, ovens, microwaves and fridges with a rectangular shape masks were sometimes confused with windows, doors or wardrobes (which are object classes not considered by our system).
Discussion
The visual information in interpretation of the phosphene simulation is an important issue due to the limited capabilities of retinal implants. Low resolution, limited dynamic range and narrow visual field are some of the limitations present in current retinal prostheses [8, 9]. Furthermore, Nanduri et al. [7] showed that phosphenes are not perfectly located in the visual field corresponding to a specific grayscale pixel. Electrically elicited phosphenes change in form and size with increasing amplitude. However, depending on the type of device the perceptual distortions will be affected differently. For example, in retinal prostheses this distortion produce visual effects called comets that might result in a substantial loos of information [77]. Other devices, such as optogenetic technologies, may suffer a loss of temporal resolution, while cortex implants suffer from crosstalk [77]. This fact results in loss of visual information which affects patient perception. However, there are research groups using computer vision approaches to try to expand the perceptible visual field in implanted patients to provide useful information in the peripheral vision [78, 79].
Another important limitation of retinal implants is phosphene dropout, which has been reported in retinal prostheses trials [17, 80] as a result of very high threshold values needed to elicit phosphenes in areas with a high number of degenerating nerve cells. Clinical trials by Thompson et al. [81] indicated that the dropout rate has significant effects on the speed and accuracy of recognition tasks. Similarly, Cao et al. [82] showed that the accuracy and efficiency in writing tasks decrease as the variability of distortion and dropout percentage increase.
To overcome the limitations of implants, SPV researchers have tried to optimize the image presentation to deliver the effective visual information in daily activities. For example, Vergnieux et al. [32] limited the cues in a virtual scene using different renderings methods, highlighting structural cues such as the edges of different surfaces for navigation. For the same purpose, Perez et al. [35] proposed a phosphene map coding using a ground representation of the obstacle-free space and a ceiling representation based on vanishing lines. Wang et al. [22] proposed two image representation strategies using background subtraction to segment moving elements for object recognition. Similarly, Guo et al. [23] and Li et al. [24] proposed two image processing strategies based on a saliency segmentation technique. For scene recognition, McCarthy et al. [83] presented a visual representation based on intensity augments in order to emphasise regions of structural change.
In terms of complex scene understanding, just few SPV studies have been proposed [24, 84]. It is well established that in realistic environment, which is made of complex scenes, the observer is forced to select relevant elements [85]. That is necessary to quickly understand the meaning of a scene as well as for object search. For instance, the set of objects in the environment give rise to a corresponding set of representations in the observer. Each representation describe the identity, location, and meaning of the item it refers to finally forming a literal representation of the environment. Some research on the visual perception of subjects has shown that because the fixation of the gaze changes in a short period of time when an environment is observed, the content of the scene can not be integrated into a complete and detailed representation [86, 87]. This suggests that such complete and detailed representations are not needed to obtain a meaning of the scene. Just a few set of object and scene elements are enough to provide access to semantic information [88].
A well-known result in psychophysics highlight that grouping elements in a scene are fundamental for scene understanding [42]. First, the grouping of pixels in a region defines a contour. In many cases, shape alone permits recognition of objects. Biederman et al. [86, 89] demonstrated that the silhouettes of the objects are generally very easy to identify and to recognize. The silhouette conveys only part of the visual information needed for the interpretation of an object. Concretely, the concepts such as convexities, concavities, or inflections of contours allow the observer to infer the surface geometry [90]. However, this bottom-up perception can be computed first and to help any top-down search to converge to the right answer. This can help to understand the visual scene through the interpretation of its content. However, in order to fully understand a scene, it is not only important the identification of individual objects comprising the scene but also their relative locations and relations [88]. Based on this idea, the segmentation of the scene into elements with semantic meaning becomes a key point in low vision.
The state of the art in semantic segmentation include deep learning algorithms. Specifically, FCNs have proven to be successful in various recognition tasks such as semantic segmentation of images. In this work, we use two FCNs to select and highlight useful information in indoor scenes such as relevant object masks and silhouettes and structural edges which recover the main structure of the scene providing sense of scale or perspective of the objects. Even though deep learning methods are known for being resource-hungry during training, they can achieve real time performance for prediction even in mobile or embedded devices [91]. Thus, our method could be easily integrated in an implant device.
The performance of the proposed visual stimuli, the SIE-OMS method, was investigated for object recognition and room identification tasks. We introduced the effect of dropout with a 10% of phosphenes omitted at random. This effect has been shown in other studies that decreases the performance of subjects in daily tasks, although it is known that with practice it will improve performance [7, 77, 79, 81]. Our results show that generating phosphene images by extracting specific segments of the scene such as structural informative edges and objects shapes are effective at improving object recognition and room identification. Moreover, the SIE-OMS method produces a large improvement on object recognition and room type identification compared to standard methods in SPV. Here, we have taken the pre-trained neural network model of He et al. [55] with the same classes as it had pre-defined without modifying any of them. The pre-defined classes coincided with those classes detected by users with the Direct method, which does not depend on the model of [55], since both methods, the SIE-OMS and the Direct, are independent. Note the case of the “window” class that was detected by users with the Direct method but was not a pre-defined class in the model [55]. We consider large objects since the scale of the images allowed a complete view of the room and that could also be identified by the users with the Direct method. However, object appearance alone is not enough for accurate object recognition in certain scenes. Since the only piece of visual data that our system uses for each object is its shape, the introduction of complementary information such as the object label could make recognition easier and avoid confusion between objects with similar shape. These factor will be considered in future studies for more realistic practices. We also note that structural edge detection is fundamental for performing tasks such as self-orientation and building a mental map of the environment. Finding such structure is crucial for personal mobility with retinal prothesis, where the bandwidth of image information that can be represented per frame is quite restricted. Overall, we can affirm that the perception and comprehension of the scene can be obtained with just a few set of elements represented in the environment.
Conclusion
We present a new visual representation of indoor environments for prosthetic vision, which emphasizes the scene structure and object shapes. By combining the output of two FCN for structural informative edges and object masks and silhouettes, we have demonstrated how different scenes and objects can be quickly recognized even under the restricted conditions of prosthetic vision. Our results demonstrate that our method is well suited for indoor scene understanding over traditional image processing methods used in visual prostheses. The key idea of our current results is that, with only a few significant elements of the scene, it is possible to obtain a good perception of the environment, even in complex and occluded scenes. This work can be used to help visually impaired people to significantly improve their ability to adapt to the surrounding environment.
References
- 1. Hartong DT, Berson EL, Dryja TP. Retinitis pigmentosa. The Lancet. 2006;368(9549):1795–1809.
- 2. Yu DY, Cringle SJ. Retinal degeneration and local oxygen metabolism. Experimental eye research. 2005;80(6):745–751. pmid:15939030
- 3.
Zhou DD, Dorn JD, Greenberg RJ. The Argus® II retinal prosthesis system: An overview. In: 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE; 2013. p. 1–6.
- 4. Cheng DL, Greenberg PB, Borton DA. Advances in retinal prosthetic research: a systematic review of engineering and clinical characteristics of current prosthetic initiatives. Current Eye Research. 2017;42(3):334–347. pmid:28362177
- 5.
Lovell NH, Hallum LE, Chen S, Dokos S, Byrnes-Preston P, Green R, et al. Advances in retinal neuroprosthetics. Handbook of Neural Engineering. 2007; p. 337–356.
- 6. Luo YHL, Da Cruz L. The Argus® II retinal prosthesis system. Progress in retinal and eye research. 2016;50:89–107. pmid:26404104
- 7.
Nanduri D, Humayun M, Greenberg R, McMahon M, Weiland J. Retinal prosthesis phosphene shape analysis. In: 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE; 2008. p. 1785–1788.
- 8. Sinclair NC, Shivdasani MN, Perera T, Gillespie LN, McDermott HJ, Ayton LN, et al. The appearance of phosphenes elicited using a suprachoroidal retinal prosthesis. Investigative ophthalmology & visual science. 2016;57(11):4948–4961.
- 9. Chen SC, Suaning GJ, Morley JW, Lovell NH. Simulating prosthetic vision: I. Visual models of phosphenes. Vision Research. 2009;49(12):1493–1506. pmid:19504749
- 10. Barriga-Rivera A, Bareket L, Goding J, Aregueta-Robles UA, Suaning GJ. Visual prosthesis: interfacing stimulating electrodes with retinal neurons to restore vision. Frontiers in neuroscience. 2017;11:620. pmid:29184478
- 11. Terasawa Y, Uehara A, Yonezawa E, Saitoh T, Shodo K, Ozawa M, et al. A visual prosthesis with 100 electrodes featuring wireless signals and wireless power transmission. IEICE Electronics Express. 2008;5(15):574–580.
- 12. Weiland JD, Liu W, Humayun MS. Retinal prosthesis. Annu Rev Biomed Eng. 2005;7:361–401. pmid:16004575
- 13.
Kien TT, Maul T, Bargiela A. A review of retinal prosthesis approaches. In: International Journal of Modern Physics: Conference Series. vol. 9. World Scientific; 2012. p. 209–231.
- 14. Eiber CD, Lovell NH, Suaning GJ. Attaining higher resolution visual prosthetics: a review of the factors and limitations. Journal of Neural Engineering. 2013;10(1):011002. pmid:23337266
- 15. Chen S, Hallum L, Suaning G, Lovell N. A quantitative analysis of head movement behaviour during visual acuity assessment under prosthetic vision simulation. Journal of Neural Engineering. 2007;4(1):S108. pmid:17325409
- 16. Dagnelie G. Visual prosthetics 2006: assessment and expectations. Expert Review of Medical Devices. 2006;3(3):315–325. pmid:16681453
- 17. Humayun MS, Weiland JD, Fujii GY, Greenberg R, Williamson R, Little J, et al. Visual perception in a blind subject with a chronic microelectronic retinal prosthesis. Vision Research. 2003;43(24):2573–2581. pmid:13129543
- 18. Choi C, Choi MK, Liu S, Kim MS, Park OK, Im C, et al. Human eye-inspired soft optoelectronic device using high-density MoS 2-graphene curved image sensor array. Nature Communications. 2017;8(1):1664. pmid:29162854
- 19. Hayes JS, Yin VT, Piyathaisere D, Weiland JD, Humayun MS, Dagnelie G. Visually guided performance of simple tasks using simulated prosthetic vision. Artificial Organs. 2003;27(11):1016–1028. pmid:14616520
- 20. Macé MJM, Guivarch V, Denis G, Jouffrais C. Simulated prosthetic vision: the benefits of computer-based object recognition and localization. Artificial Organs. 2015;39(7):E102–E113. pmid:25900238
- 21. Qiu C, Lee KR, Jung JH, Goldstein R, Peli E. Motion Parallax Improves Object Recognition in the Presence of Clutter in Simulated Prosthetic Vision. Translational Vision Science & Technology. 2018;7(5):29–29.
- 22. Wang J, Lu Y, Gu L, Zhou C, Chai X. Moving object recognition under simulated prosthetic vision using background-subtraction-based image processing strategies. Information Sciences. 2014;277:512–524.
- 23. Guo F, Yang Y, Gao Y. Optimization of Visual Information Presentation for Visual Prosthesis. International Journal of Biomedical Imaging. 2018;2018. pmid:29731769
- 24. Li H, Su X, Wang J, Kan H, Han T, Zeng Y, et al. Image processing strategies based on saliency segmentation for object recognition under simulated prosthetic vision. Artificial Intelligence in Medicine. 2018;84:64–78. pmid:29129481
- 25. van Rheede JJ, Kennard C, Hicks SL. Simulating prosthetic vision: Optimizing the information content of a limited visual display. Journal of Vision. 2010;10(14):32–32. pmid:21191130
- 26. Bourkiza B, Vurro M, Jeffries A, Pezaris JS. Visual acuity of simulated thalamic visual prostheses in normally sighted humans. PloS one. 2013;8(9):e73592. pmid:24086286
- 27. Sommerhalder J, Oueghlani E, Bagnoud M, Leonards U, Safran AB, Pelizzone M. Simulation of artificial vision: I. Eccentric reading of isolated words, and perceptual learning. Vision Research. 2003;43(3):269–283. pmid:12535986
- 28. Dagnelie G, Barnett D, Humayun MS, Thompson RW. Paragraph text reading using a pixelized prosthetic vision simulator: parameter dependence and task learning in free-viewing conditions. Investigative Ophthalmology & Visual Science. 2006;47(3):1241–1250.
- 29. Vurro M, Crowell AM, Pezaris JS. Simulation of thalamic prosthetic vision: reading accuracy, speed, and acuity in sighted humans. Frontiers in Human Neuroscience. 2014;8:816. pmid:25408641
- 30. McKone E, Robbins RA, He X, Barnes N. Caricaturing faces to improve identity recognition in low vision simulations: How effective is current-generation automatic assignment of landmark points? PloS one. 2018;13(10):e0204361. pmid:30286112
- 31. Wang J, Wu X, Lu Y, Wu H, Kan H, Chai X. Face recognition in simulated prosthetic vision: face detection-based image processing strategies. Journal of Neural Engineering. 2014;11(4):046009. pmid:24921713
- 32. Vergnieux V, Macé MJM, Jouffrais C. Simplification of visual rendering in Simulated Prosthetic Vision facilitates navigation. Artificial Organs. 2017;41(9):852–861. pmid:28321887
- 33. Dagnelie G, Keane P, Narla V, Yang L, Weiland J, Humayun M. Real and virtual mobility performance in simulated prosthetic vision. Journal of Neural Engineering. 2007;4(1):S92. pmid:17325421
- 34.
Vergnieux V, Macé MJM, Jouffrais C. Wayfinding with Simulated Prosthetic Vision: Performance comparison with regular and structure-enhanced renderings. In: 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society; 2014. p. 2585–2588.
- 35.
Perez-Yus A, Bermudez-Cameo J, Lopez-Nicolas G, Guerrero JJ. Depth and Motion Cues with Phosphene Patterns for Prosthetic Vision. In: IEEE International Conference on Computer Vision.; 2017. p. 1516–1525.
- 36. Parikh N, Itti L, Humayun M, Weiland J. Performance of visually guided tasks using simulated prosthetic vision and saliency-based cues. Journal of Neural Engineering. 2013;10(2):026017. pmid:23449023
- 37. Zhu H, Meng F, Cai J, Lu S. Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation. Journal of Visual Communication and Image Representation. 2016;34:12–27.
- 38. Wang W, Wang Y, Wu Y, Lin T, Li S, Chen B. Quantification of Full Left Ventricular Metrics via Deep Regression Learning With Contour-Guidance. IEEE Access. 2019;7:47918–47928.
- 39. Zhang Z, Duan C, Lin T, Zhou S, Wang Y, Gao X. GVFOM: a novel external force for active contour based image segmentation. Information Sciences. 2020;506:1–18.
- 40. Wertheimer M. Laws of organization in perceptual forms. Psycologische Forschung. 1938; p. 301–350.
- 41.
Bruce V, Green PR, Georgeson MA. Visual perception: Physiology, psychology, & ecology. Psychology Press; 2003.
- 42. Hoffman DD, Singh M. Salience of visual parts. Cognition. 1997;63(1):29–78. pmid:9187064
- 43.
Minsky MA, Pappert S. Project MAC Progress Report IV. MIT Press, Cambridge, Mass.; 1967.
- 44. Brice CR, Fennema C. Scene analysis using regions. Artificial Intelligence. 1970;1:205–226.
- 45.
Haralick RM, Shapiro LG. Image segmentation techniques. In: Applications of Artificial Intelligence II. vol. 548. International Society for Optics and Photonics; 1985. p. 2–10.
- 46.
Adelson EH. On seeing stuff: the perception of materials by humans and machines. In: Human vision and electronic imaging VI. vol. 4299. International Society for Optics and Photonics; 2001. p. 1–12.
- 47.
Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J. A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:170406857. 2017;.
- 48.
Dai J, He K, Sun J. Instance-aware semantic segmentation via multi-task network cascades. In: IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 3150–3158.
- 49.
Sanchez-Garcia M, Martinez-Cantin R, Guerrero JJ. Indoor Scenes Understanding for Visual Prosthesis with Fully Convolutional Networks. In: 14th International Conference on Computer Vision Theory and Applications, 2019, pp 218-225; 2019.
- 50. Sun M, Kim Bs, Kohli P, Savarese S. Relating things and stuff via object property interactions. IEEE transactions on pattern analysis and machine intelligence. 2013;36(7):1370–1383.
- 51.
Tighe J, Niethammer M, Lazebnik S. Scene parsing with object instances and occlusion ordering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014. p. 3748–3755.
- 52.
Yao J, Fidler S, Urtasun R. Describing the scene as a whole: joint object detection. In: Proceedings of CVPR. Citeseer; 2012.
- 53.
Kirillov A, He K, Girshick R, Rother C, Dollár P. Panoptic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. p. 9404–9413.
- 54.
Fernandez-Labrador C, Facil JM, Perez-Yus A, Demonceaux C, Guerrero JJ. PanoRoom: From the Sphere to the 3D Layout. arXiv preprint arXiv:180809879. 2018;.
- 55.
He K, Gkioxari G, Dollár P, Girshick R. Mask r-cnn. In: IEEE International Conference on Computer Vision; 2017. p. 2980–2988.
- 56.
Mallya A, Lazebnik S. Learning informative edge maps for indoor scene layout prediction. In: IEEE International Conference on Computer Vision; 2015. p. 936–944.
- 57.
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 3431–3440.
- 58.
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-assisted Intervention. Springer; 2015. p. 234–241.
- 59.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.
- 60.
Yu F, Seff A, Zhang Y, Song S, Funkhouser T, Xiao J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:150603365. 2015;.
- 61.
Girshick R. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision; 2015. p. 1440–1448.
- 62.
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2117–2125.
- 63.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft coco: Common objects in context. In: European Conference on Computer Vision. Springer; 2014. p. 740–755.
- 64. Fácil JM, Concha A, Montesano L, Civera J. Single-View and Multi-View Depth Fusion. IEEE Robotics and Automation Letters. 2017;2(4):1994–2001.
- 65. Barnes N. The role of computer vision in prosthetic vision. Image and Vision Computing. 2012;30(8):478–479.
- 66.
Feng D, McCarthy C. Enhancing scene structure in prosthetic vision using iso-disparity contour perturbance maps. In: 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society; 2013. p. 5283–5286.
- 67. Snaith M, Lee D, Probert P. A low-cost system using sparse vision for navigation in the urban environment. Image and Vision Computing. 1998;16(4):225–233.
- 68. Chang M, Kim H, Shin J, Park K. Facial identification in very low-resolution images simulating prosthetic vision. Journal of Neural Engineering. 2012;9(4):046012. pmid:22766585
- 69. Sanocki T, Bowyer KW, Heath MD, Sarkar S. Are edges sufficient for object recognition? Journal of Experimental Psychology: Human Perception and Performance. 1998;24(1):340.
- 70. Van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, et al. scikit-image: image processing in Python. PeerJ. 2014;2:e453. pmid:25024921
- 71. Irons JL, Gradden T, Zhang A, He X, Barnes N, Scott AF, et al. Face identity recognition in simulated prosthetic vision is poorer than previously reported and can be improved by caricaturing. Vision research. 2017;137:61–79. pmid:28688907
- 72. Lu Y, Wang J, Wu H, Li L, Cao X, Chai X. Recognition of objects in simulated irregular phosphene maps for an epiretinal prosthesis. Artificial organs. 2014;38(2):E10–E20. pmid:24117959
- 73. Hu J, Xia P, Gu C, Qi J, Li S, Peng Y. Recognition of similar objects using simulated prosthetic vision. Artificial organs. 2014;38(2):159–167. pmid:24033534
- 74. Fornos AP, Sommerhalder J, Rappaz B, Safran AB, Pelizzone M. Simulation of artificial vision, III: Do the spatial or temporal characteristics of stimulus pixelization really matter? Investigative Ophthalmology & Visual Science. 2005;46(10):3906–3912.
- 75. Li H, Han T, Wang J, Lu Z, Cao X, Chen Y, et al. A real-time image optimization strategy based on global saliency detection for artificial retinal prostheses. Information Sciences. 2017;415:1–18.
- 76.
Quattoni A, Torralba A. Recognizing indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition; 2009. p. 413–420.
- 77. Beyeler M, Rokem A, Boynton GM, Fine I. Learning to see again: biological constraints on cortical plasticity and the implications for sight restoration technologies. Journal of neural engineering. 2017;14(5):051003. pmid:28612755
- 78. Li H, Zeng Y, Lu Z, Cao X, Su X, Sui X, et al. An optimized content-aware image retargeting method: toward expanding the perceived visual field of the high-density retinal prosthesis recipients. Journal of neural engineering. 2018;15(2):026025. pmid:29076459
- 79. Cha K, Horch KW, Normann RA. Mobility performance with a pixelized vision system. Vision research. 1992;32(7):1367–1372. pmid:1455709
- 80. Brindley GS, Lewin W. The sensations produced by electrical stimulation of the visual cortex. The Journal of physiology. 1968;196(2):479–493. pmid:4871047
- 81. Thompson RW, Barnett GD, Humayun MS, Dagnelie G. Facial recognition using simulated prosthetic pixelized vision. Investigative ophthalmology & visual science. 2003;44(11):5035–5042.
- 82.
Cao X, Li H, Lu Z, Chai X, Wang J. Eye-hand coordination using two irregular phosphene maps in simulated prosthetic vision for retinal prostheses. In: IEEE 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics; 2017. p. 1–5.
- 83.
McCarthy C, Feng D, Barnes N. Augmenting intensity to enhance scene structure in prosthetic vision. In: IEEE International Conference on Multimedia and Expo workshops; 2013. p. 1–6.
- 84. Al-Atabany W, Al Yaman M, Degenaar P. Extraspectral Imaging for Improving the Perceived Information Presented in Retinal Prosthesis. Journal of Healthcare Engineering. 2018;. pmid:29849997
- 85.
Boucart M, Tran THC. Object and Scene Recognition Impairments in Patients with Macular Degeneration. In: Object Recognition. InTech; 2011.
- 86. Biederman I. Recognition-by-components: a theory of human image understanding. Psychological review. 1987;94(2):115. pmid:3575582
- 87.
Peterson MA, Rhodes G. Perception of faces, objects, and scenes: Analytic and holistic processes. Oxford University Press; 2003.
- 88.
Biederman I. On the semantics of a glance at a scene. In: Perceptual Organization. Routledge; 2017. p. 213–253.
- 89. Biederman I, Ju G. Surface versus edge-based determinants of visual recognition. Cognitive Psychology. 1988;20(1):38–64. pmid:3338267
- 90. Koenderink JJ. What does the occluding contour tell us about solid shape? Perception. 1984;13(3):321–330. pmid:6514517
- 91.
Wang J, Cao B, Yu P, Sun L, Bao W, Zhu X. Deep learning towards mobile applications. In: IEEE 38th International Conference on Distributed Computing Systems; 2018. p. 1385–1393.