Fig 1.
Representative natural images from which fragments were extracted.
Red rectangles indicate image regions in which the best features were identified. Many other candidate fragments were cut out from each image but are not explicitly indicated (See Materials and Methods). Images with single and double asterisks include fragments, where the best features of multiple recording sites were identified. These fragments include part of faces.
Fig 2.
Outline of procedure for searching for visual features that explain neural responses.
The physiological dataset comprises neural responses of each site to stimulus images, which are used to construct a neural response vector (right blue-shaded vertical array). Similarities between preprocessed feature candidates and stimulus images provide a table of feature-based responses in which each vertical column corresponds to a feature-based response vector (left green-shaded vertical array). We calculate the correlation coefficients between the neural response vector and all feature-based response vectors to determine the feature with the maximum correlation coefficient (lower panel). Please note that the natural images for extracting feature candidates and the stimulus images for neural recordings are completely independent sets.
Fig 3.
Preprocessing of a stimulus image (top) and an image fragment (bottom). Left panel, a representative stimulus image and a fragment. Right panel, a preprocessed image and a fragment with four orientations and three color matrices (Band 1). Magnitudes of elements in each matrix are given in a gray scale (see Materials and Methods). Please note that the stimulus images were also preprocessed with other scale bands as well.
Fig 4.
Frequency distribution of feature candidates for determining the value of c.c. in four representative sites.
The red line and number with an arrow in each panel indicates the c.c. with the best feature candidate. The red dotted line and number with an arrow in each panel indicates the correlation coefficient that gives significant correlation coefficient (p = 0.01).
Fig 5.
Representative fragments in five additional feature candidate sets.
(1) Gabor patch set (n = 18) (2) human full-face set (n = 570) (3) monkey full-face set (n = 764) (4) human facial-part set (n = 56,000) (5) monkey facial-part set (n = 56,000) (6) 1K Gabor patch set (n = 1,080) White rectangles in (4) and (5) show regions chosen as feature candidates. To increase the variety of feature candidates, we allowed partial overlapping of feature candidates in the facial-part sets.
Fig 6.
Identification and verification of the best feature for site cc (H1).
(A) The object response tuning curve (Insets, representative objects). The site preferred monkey to human faces on top of a general preference for faces over non-face objects. Horizontal axis shows stimulus images ranked in order of stimulus-evoked responses. Vertical axis shows stimulus-evoked responses of the site. The horizontal broken line indicates no evoked responses. (B) The best feature for site cc (right) and a natural image from which the feature was extracted (left). The feature was extracted from the outlined region in the natural image and was characterized by four edges (arrows and arrow heads) and no color information (α = 1). Please note that the arrows and arrow heads in the left and right panels point to corresponding spatial positions. The left inset shows the face stimulus that evoked the strongest neural response. The red contour indicates the region for which the feature gave the maximum response within the face stimulus. Gabor filter angles are given in the upper right corner. (C) Scattergrams between neural and feature-based responses for subsampled stimulus images (52 among 104 objects). Upper and lower panels are complementary stimulus sets. Data points represent different stimuli. There was a significant correlation between neural and feature-based responses for all four iterations (inset, c.c., r; t-test for Pearson's c.c., p < 0.0001). (D) (Upper panel) distributions of c.c. for 800 iterations of the delete-half jackknife resampling. The distributions of the top five features among 560,000 candidates are given as representatives. The distribution of the best feature (red) has a mean value statistically significantly different from the mean values of the other distributions (blue) (mean ± s.d. of the best, 0.82 ± 0.03; Welch's t-test, p = 0.0034). (Lower panel) Results of the stimulus-shuffling test showing distribution of c.c.’s of the best features identified for artificial response vectors generated by shuffling stimulus images against neural responses (mean ± s.d., 0.47 ± 0.03; n = 100). The broken line indicates the value of c.c. with p = 0.01 (c.c., 0.55). The red line indicates the mean c.c. of the best feature for the neural response vectors.
Fig 7.
The best features for three other representative sites: P, hh, and Q (H1).
Conventions are the same as in Fig 6B and 6D. (A) The best feature for site P (mean ± s.d., 0.77 ± 0.03). The mean value of c.c. of the best feature is significantly different from the other feature candidates (Welch's t-test, p = 0.0068). The c.c. of the best feature for artificial response vectors is 0.44 ± 0.03 (mean ± s.d.; n = 100). (B) Identified features for site hh. Seven features are not significantly different in terms of mean c.c. value from the best (distributions in red; mean ± s.d., 0.74 ± 0.03). These were derived from the same region of a natural image (upper left) and are characterized by local orientations (upper right four matrices from the top) and color (fifth matrix from the top in upper right). While those in red solid lines differ over a consecutive scale range, the feature indicated with red broken line differs in terms of color channel contribution (with α = 0.5 for the indicated feature versus 0.6 for the others). The distributions of four features that are significantly different from the best are shown in blue (Welch's t-test, p = 0.0004). Some of the distributions shown in red overlap and are indistinguishable in this panel. The c.c. of the best feature for artificial response vectors is 0.45 ± 0.03 (mean ± s.d.; n = 100). (C) Identified features for site Q. The best feature originated from a natural image in the upper left and represents a combination of local orientations (mean ± s.d., 0.70 ± 0.04; red broken line). The other three features do not differ significantly in terms of mean c.c. value from the best (red solid line). They originated from a natural image different from the best and are composed of orientation and color channel components (α = 0.8) (upper right). The c.c. distributions of the identified features differ significantly from the distributions of the others (blue) (Welch's t-test, p = 0.0022). The c.c. of the best feature for artificial response vectors is 0.51 ± 0.02 (mean ± s.d.; n = 100).
Fig 8.
Identification and verification of the best feature for a representative site from which responses to 1,000 stimulus images were recorded.
(A) The object response tuning curve shows characteristics of the neural response vector of site M. (B) The best feature for site M (H3). The image fragment from which the best feature was derived is outlined in red. The feature was matched to the entire stimulus image, where the upper-left part of the feature was matched to complex shapes around the eyes of the monkey, and the upper-right part of the feature was matched to the edge between the monkey face and background at the same area in the stimulus image. The best feature does not require color channels (α = 1.0). Conventions are the same as in Fig 6B. (C) (Upper panel) Distribution of correlation coefficients for 200 delete-half jackknife iterations for the best feature (mean ± s.d., 0.67 ± 0.03). Please note that the standard deviation of the distribution is smaller than in the cases shown in Figs 6 and 7 because of the larger number of stimulus images. (Lower panel) The distribution of the correlation coefficient for 100 artificial response vectors generated by shuffling stimulus images against neural responses (mean ± s.d., 0.00 ± 0.05). The red line indicates the average of the correlation coefficients of the 200 delete-half jackknife iterations for the best feature. (D) Scattergrams between the neural and feature-based responses for the training (upper) and test (lower) set. (E) The c.c. value for the training and test set for all examined sites (n = 16).
Fig 9.
Comparison between c.c. values for the best features extracted from the standard set and those extracted from the other sets.
Different symbols represent different fragment sets, while different points with the same symbol represent different sites (n = 16 for H1 and H3; n = 19 for H2). The red lines parallel to the horizontal and vertical axes indicate the correlation coefficient that gives significant correlation coefficient (c.c. = 0.35 for A-C, 0.12 for D; p = 0.01). (A) The result from H1. The number of feature candidates are 560000, 18, 570, and 764 for the standard set, Gabor patch set, human full-face set, and monkey full-face set, respectively (B-D) The number of feature candidates are 560, 1080, 570, and 764 for the standard set, 1K Gabor patch set, human full-face set, and monkey full-face set, respectively. For the standard set, feature candidates (n = 560) were randomly subsampled from the standard set consisting of 560,000 candidates. The c.c. value for the best fragment (vertical axis) gives average of 10 different subsampled sets. The result from H1, H2, and H3 are given in B, C, and D, respectively. (C, D) The results from H2 (C) and H3 (D). Please note that in H3 (D), the neural response vectors are divided into training and test sets, and the result from the test set is given as in Fig 8.
Fig 10.
Columns in the face domains detect global and local structures of faces.
The best fragments and the region within the best face stimulus where the fragments had the best match (red rectangles) are indicated for 16 sites in H1 (A), 19 sites in H2 (B) and 16 sites in H3 (C). The format of the feature representation is given in upper right (box with broken line in blue). In cases in which color information is critical, the relative weight between orientation and color channels, α, is indicated. Multiple fragments are represented in some sites such as Q in which multiple fragments do not significantly differ. Fragments detecting a local part and those detecting the global structure are grouped by eye and shown separately in left and right. The sites where both fragments with a local part and the global structure were detected were shown in both left and right.
Fig 11.
Site P detects a top-heavy configuration of faces typically consisting of two eyes and a nose below.
The pictures represent eight faces that evoked the best (left) to eighth best (right) neural responses in site P. The red contour indicates the region for which the feature gave the maximum response, and arrows and an arrowhead point critical points of the feature (see Fig 7A).
Fig 12.
Predicted responses to upright and inverted faces provide a possible explanation of inversion effects.
Upper panel, mean feature-based responses to eight monkey (Mo) and eight human (Hu) faces in upright and inverted conditions for site cc (H1). Error bars indicate standard deviation. Responsive regions in the representative faces are given below in red. Lower panel, average of absolute differences between responses to monkey and human faces in upright (left) and inverted conditions (right). Error bars indicate the standard error of the mean. The numbers indicate p-values (Wilcoxon signed-rank test).
Fig 13.
Scattergram between the sparseness index and c.c. of the best feature for 16 sites in H1 (A) and 19 sites in H2 (B). Different points represent different sites.
Fig 14.
The mean c.c. value of the best features for all recording sites in H1 that include sites responding to non-face images.
The color indicates different sub-regions of IT cortex defined by similarity in object responses, such as red for face domain, blue for monkey-body domain, and green for anti-face domain [11]. Particular object categories could not be identified for the other domains [11].