Modeling second-order boundary perception: A machine learning approach

Visual pattern detection and discrimination are essential first steps for scene analysis. Numerous human psychophysical studies have modeled visual pattern detection and discrimination by estimating linear templates for classifying noisy stimuli defined by spatial variations in pixel intensities. However, such methods are poorly suited to understanding sensory processing mechanisms for complex visual stimuli such as second-order boundaries defined by spatial differences in contrast or texture. We introduce a novel machine learning framework for modeling human perception of second-order visual stimuli, using image-computable hierarchical neural network models fit directly to psychophysical trial data. This framework is applied to modeling visual processing of boundaries defined by differences in the contrast of a carrier texture pattern, in two different psychophysical tasks: (1) boundary orientation identification, and (2) fine orientation discrimination. Cross-validation analysis is employed to optimize model hyper-parameters, and demonstrate that these models are able to accurately predict human performance on novel stimulus sets not used for fitting model parameters. We find that, like the ideal observer, human observers take a region-based approach to the orientation identification task, while taking an edge-based approach to the fine orientation discrimination task. How observers integrate contrast modulation across orientation channels is investigated by fitting psychophysical data with two models representing competing hypotheses, revealing a preference for a model which combines multiple orientations at the earliest possible stage. Our results suggest that this machine learning approach has much potential to advance the study of second-order visual processing, and we outline future steps towards generalizing the method to modeling visual segmentation of natural texture boundaries. This study demonstrates how machine learning methodology can be fruitfully applied to psychophysical studies of second-order visual processing.


Introduction
Many of the most common functions of sensory systems involve detection, identification or discrimination of particular stimuli. For example, a critically important ability of early visual processing is to segment images into distinct regions, which may correspond to cohesive surfaces or objects [1]. To better understand the underlying mechanisms, it would be useful to fit biologically plausible models of surface segmentation to psychophysical data, and assess how well such models can account for independent datasets.
A widely used approach to modeling human psychophysics has been to assume that performance can be understood in terms of how well a visual stimulus matches an internal "template", typically modeled as a linear spatial filter. This approach uses high-dimensional regression models fit to data from tasks in which a subject detects or classifies a sensory stimulus presented with superimposed white noise [2][3][4][5], to recover an estimate of the linear filter that best accounts for the data. This system-identification (SI) approach to psychophysics, often called "psychophysical reverse correlation" or "classification image analysis", parallels neurophysiology studies that fit linear filter models to neural responses elicited by white noise stimuli [6,7]. Psychophysical SI methods have been applied in varied problem domains and yielded valuable insight into perceptual mechanisms [4,[8][9][10][11][12][13].
One limitation of most previous psychophysical system identification studies has been the use of stimuli that are typically defined by spatial or temporal variations in first-order statistics like luminance or color. However, many natural visual stimuli such as occlusion boundaries (e.g., the top image patch in Fig 1a, showing animal fur occluding a forest floor) are defined by spatial variations in higher-order image statistics, e.g. texture or contrast, as well as simple luminance differences [14][15][16][17][18]. Such "second-order" boundaries (Fig 1a, bottom) have been of particular interest because their detection cannot be explained by simple linear filters [19,20], and consequently human performance cannot be readily modeled using conventional psychophysical SI methods. More generally, there is a need for a psychophysical modeling approach that can handle a broader range of stimuli and tasks.
Here we present a novel machine learning method to extend psychophysical SI to more complex naturalistic stimuli such as second-order boundaries. Our general approach is to fit image-computable hierarchical neural network models of second-order vision (Fig 2) directly to psychophysical trial data, similar to recent studies using multi-stage models to characterize nonlinear responses in sensory neurons [21][22][23][24]. Our application of this framework generalizes a class of biologically plausible "Filter-Rectify-Filter" (FRF) models that have often been invoked to explain second-order processing [20]. However, in contrast to previous general schemes for second-order vision [25][26][27] and quantitative models [28,29], we take a machine learning approach in which we estimate the defining model parameters by direct fitting to psychophysical trial data. This permits more incisive tests of predictions on novel datasets in order to validate our models, and to evaluate competing models.
We focus on contrast-defined second-order boundaries (Fig 1a and 1b) since they are well studied psychophysically [30][31][32][33][34], and there are neurons in the early mammalian visual cortex that give selective responses to contrast modulation stimuli [35][36][37][38]. Performance on two tasks making use of contrast-defined boundaries is examined: (1) identification of boundary orientation, and (2) discrimination between two slightly off-vertical boundaries (Fig 1c). The results demonstrate that subjects utilize spatial information in markedly different ways for these two tasks, employing "region-based" processing for the orientation identification task and "edgebased" processing for the orientation discrimination task, in both cases consistent with an ideal observer.
We further demonstrate the power of this methodology for evaluating competing hypotheses regarding the nature of channel summation within the FRF scheme [29,[39][40][41] by comparing the ability of two different model architectures to fit human performance on a contrast edge detection task in which the carrier pattern contains elements with two orientations. Posthoc Bayesian model comparison [42][43][44] demonstrates a preference for a model in which carrier pattern orientation is integrated across channels by the second-stage filters, consistent with previous psychophysical studies [40].
Finally, we discuss potential applications of this methodology to future quantitative studies of first-and second-order cue integration [45][46][47] and natural boundary detection [18,48]. We suggest that the application of machine learning methodology similar to that presented here can potentially help to better quantify visual processing of natural occlusion boundaries and complex natural visual stimuli more generally.

Results
Observers performed two different psychophysical tasks making use of contrast-modulated boundary stimuli like those shown in Fig 1 (see Methods). In the first task, orientation-identification (Fig 1c, left), observers determined in which of two oblique orientations (-/+ 45 deg, L/ R) a near-threshold contrast boundary was presented, with varying amounts of contrast modulation (Experiments 1 and 3). We also employed an orientation-discrimination task (Fig 1c, right) in which observers indicated the orientation (L/R) of a supra-threshold contrast boundary, whose orientation was perturbed slightly from vertical by varying amounts (-/+ 6-7 deg.) (Experiment 2). In all experiments, stimuli were presented centrally, subtending 4 degrees of visual angle (dva), with envelope phase randomized (0 or 180 deg.) across trials. Trial-wise responses obtained from these experiments were used to fit and evaluate a model (Fig 2) implementing a "Filter-Rectifty-Filter" (FRF) operation [20] on each stimulus image, with the goal of characterizing the nature of spatial summation for perception of contrast boundaries, as well as the shape of the second-stage filter nonlinearity. We describe the model in greater depth before presenting the experimental results.

Modeling framework
The model used to fit psychophysical trial data (Fig 2) employed a first stage comprised of a bank of fine-scale Gabor filters that compute local oriented energy in a manner similar to V1 complex cells [49]. The second stage consisted of a pair of filters, representing the two task alternatives (L/R), each instantiated by a weighted sum of the normalized outputs of the firststage filters. In contrast to the first-stage filters, these second-stage filters analyze the image at a more coarse spatial scale, integrating variations in contrast across the boundary. Finally, the outputs of these two second-stage filters are passed through a static non-linearity h(u) = |u| α and then compared to decide on a trial-by-trial basis if a given stimulus boundary appeared more likely to be tilted left or right (L/R), i.e. the probability 0 < P(R) < 1 of the stimulus being classified by the observer as R. The main goal of Experiments 1 and 2 was to estimate the shapes of the second-stage filters. The filter shapes are not specified beforehand, but rather are learned by fitting actual psychophysical data, with the goal of revealing how the observer spatially integrates texture information.
To reduce model parameter space dimensionality, we performed downsampling between the first-and second-stage filters, i.e. pooling of spatially adjacent first-stage filter responses.
Here we compared three ways of implementing downsampling within each locality: (1) taking the simple average (AVG), (2) taking the largest value (MAX) [50], and (3) taking just one of the values, e.g. the center one (sub-sampling, SUB) [26]. In Experiment 1 we considered all three pooling rules; most analyses utilized 16x16 downsampling, but 3 downsampling sizes (22x22, 16x16 and 8x8) were compared. In Experiments 2 and 3 we considered only AVG pooling and a single downsampling size (Experiment 2: 22x22, Experiment 3: 12x12).
We implement the model's final (L/R) classification in two alternative ways: In the deterministic model (DET), if the output of the R filter is larger than that of the L filter (plus the bias term) so that P(R) > 0.5, the stimulus will be classified as R, otherwise it is classified as L. In the stochastic model (STO), the output P(R) specifies only the probability of classifying the stimulus as R or L, thus introducing randomness to the model without requiring an additional free parameter.
A common approach to make the estimation problem more tractable is to utilize reasonable prior constraints on the model parameter space [51][52][53]. Here we use two kinds of Bayesian priors: (1) ridge-regression [44,54] to penalize large weights (ridge), and (2) ridge-regression combined with an additional smoothness prior [51] to penalize excessively "rough" secondstage filter weight profiles (ridge + smooth). Model performance was evaluated by using different randomly selected partitions (called folds) of the data for training and testing, a standard practice in machine learning (i.e., k-fold cross-validation [51]). Additional details of the model are given in Methods.

Experiment 1: Boundary orientation identification
Stimuli and task. Observers viewed contrast-defined texture boundaries with varying degrees of modulation depth (Fig 1b), and performed forced-choice judgments of whether the boundary was oriented right-or left-oblique (45 degrees clockwise or counterclockwise, Fig 1c  left). In Experiment 1-VAR (7 observers), a method of constant stimuli was used to vary contrast modulation depth over a range spanning values near threshold but also including some levels producing near-perfect, or chance, performance. In another version (Experiment 1-FIX, 3 observers) the contrast modulation depth was held constant at a near-threshold level.
Model fit to representative observers. Fig 3a shows the results of fitting the model (16x16 downsampling, AVG pooling) to data from one representative observer (AMA) in Experiment 1-VAR. The top panels show estimated second-stage filter weights as image maps  across the spatial extent of the stimuli, with pixel color indicating weight values. The black dashed lines in the bottom plots show mean 1-D profiles, obtained by collapsing the 2-D weight maps orthogonally to the boundary orientations, whereas thick solid lines (red = ridge, green = ridge + smooth) show profiles obtained from over 30 bootstrap-resampled training sets. We see that these mean 1-D profiles are nearly indistinguishable from one another, all resembling a step-like function (see below). While many of the individual 2-D weights can be quite variable (S7 Fig), the 1-D weight profile magnitudes lie well away from zero (Fig 3, thin dashed lines show +/-1 SEM). The addition of the smoothing prior (ridge + smooth) results in weighting functions that are very similar, but less "noisy" in appearance, consistent with previous results with classification images and receptive field modeling [51,52]. Furthermore, all three pooling rules (AVG, MAX, SUB) yielded consistent 2-D and 1-D weight profiles (S8 Fig).
Fig 3b shows the model fit to data from a representative observer (JJF) in Experiment 1-FIX. In this version of the experiment, the contrast modulation depth was kept constant at approximately threshold level for each observer (S1 Table), similar to most previous work in the classification image literature [5,8]. Here we see very similar results as for the observer in Experiment 1-VAR, also showing a step-like function for this task. As before, this result was robust to the pooling rule (S9 Fig).
To assess the extent to which observers were employing the same spatial integration as an ideal observer, we fit the model to stimulus category labels rather than observer responses. The 2-D weight maps and 1-D profiles obtained for an ideal observer in Experiment 1-VAR (Fig  3c), reveal that the ideal weight map is a step-like function like that seen in the human results. Similar results for an ideal observer fit to stimulus category labels in Experiment 1-FIX are shown in S10 Fig. Edge-based vs. region-based analysis. When finding a boundary between adjacent textures, the visual system might make use only of texture elements close to the boundary ("edgebased"), or it might utilize the entirety of both adjacent texture regions ("region-based"). A number of previous studies have investigated this question for textures defined by orientation differences [55][56][57], but to the best of our knowledge this issue has not been addressed systematically for contrast-defined textures. The results from two individual observers (Fig 3a and  3b) as well as an ideal observer (Fig 3c) indicate that the entire texture regions on either side of the boundary are contributing to performance, giving step-like 1-D profiles-all indicative of "region-based" processing.
To examine the robustness of this result across observers and for different model variants, Fig 4 shows the 1-D averaged profiles of second-stage filter weights for all observers (thin colored lines) and group averages (thick black lines) in Experiment 1-VAR (Fig 4a) and Experiment 1-FIX (Fig 4b). We see that the magnitudes of the 1-D weights are approximately constant as one moves from the diagonal to the boundary, indicative of region-based processing, and that these results are also consistent for all three pooling rules (AVG, MAX, SUB). To explore whether the step-like 1-D profile result is consistent with an ideal observer, plots of the normalized (scaled to unit norm) ideal observer 1-D weight profiles (30 bootstraps) are shown in Fig 5a as thick black lines, with 95% confidence intervals (+/-1.96 SEM) plotted as dashed black lines. We see that the normalized 1-D weight profiles for individual observers (dashed color lines), and for means over observers (thick red dashed lines) generally lie within the 95% confidence intervals of the ideal observer. Therefore, the performance of the human observers is very much consistent with that of the ideal observer, both of which give similar step-like profiles and thus seem to integrate contrast information across the whole region.
To test whether our finding of region-based processing was robust to increasing the carrier spatial frequency, we ran an additional control version of the experiment on 3 naïve observers (Experiment 1-HFC) in which we reduced the micropattern size by a factor of 4, from 32 to 8 pixels (0.5 dva. to 0.125 dva), while increasing the density (to 8192 micropatterns) but holding overall RMS contrast constant (at 14%). Such a stimulus would potentially engage many more contrast-modulation tuned neurons selective for higher carrier spatial frequencies, thus having receptive fields more localized to the boundary [38] and therefore potentially revealing edgebased processing. However, even with the high-frequency carrier we still observed regionbased processing (Fig 5b), consistent with an ideal observer.
Second-stage filter nonlinearity. In Experiment 1, the best-fit expansive nonlinearity exponent α for the second-stage filters is greater than unity for all observers, for both the ridge and ridge + smooth priors.

Analysis of model accuracy
To better assess the meaningfulness of the above model-fitting results, it is important to make quantitative measures of the fitted models' accuracy in predicting observers' responses to novel stimuli. Such measures can also address the relative merit of different variants of the estimation procedure (ridge vs ridge + smooth prior) and of the model details, such as downsampling method (AVG, MAX, SUB) and classification rule (DET, STO). In this section we describe some of our analyses of model performance, and refer to other assessments in the Supplementary Material.
Psychometric function agreement. To the extent that the model architecture is appropriate and the data are sufficient to obtain accurate estimates of parameters, then the fitted model should be able to predict each observer's responses. Consequently, we constructed psychometric functions of proportion correct on the orientation identification task as a function of the modulation depth across the boundary for individuals observers in Experiment 1-VAR-three representative observers are shown in Fig 6 (AVG downsampling). The model predictions (green, red) on the test datasets (which were not used to train models or optimize hyperparameters) generally lie within or near the 95% binomial proportion confidence intervals (+/-1.96 SEM) of observer performance (Fig 6, blue dotted lines). However, the deterministic model (DET, left column) tends to over-predict performance at larger contrast modulations for some observers. Implementing the model stochastically (STO, right column) improved the model-observer agreement. Overall performance agreement. Another way to assess how well fitted models account for observer responses is to compare their overall performance (i.e. proportion correct averaged over all modulation depths tested) with observer performance. Such an analysis succinctly characterizes model accuracy for each observer and model variant with a single number (as opposed to an entire psychometric function).   28; ridge + smooth: 23/28) of test folds having significantly different (binomial proportion difference test, 95% CI) proportions correct (Fig 7b, top). Despite the systematic over-prediction of observer performance (by about 6-8%), we still observe significant correlations between predictions and actual performance across observers (ridge: r = 0.983, p < 0.001; ridge + smooth: r = 0.972, p < 0.001, N = 7) and folds (ridge: r = 0.872, p < 0.001; ridge + smooth: r = 0.806, p < 0.001, N = 28).
Implementing the classification stochastically produced much better agreement (AVG-STO: Fig 7a, middle), with relatively few (ridge: 4/28; ridge + smooth: 6/28) test folds having significantly different proportions correct (Fig 7b, middle). As before, strong correlations between model and observer performance are seen across observers (ridge: r = 0.996, p < 0.001; ridge + smooth: r = 0.997, p < 0.001, N = 7) and folds (ridge: r = 0.890, p < 0.001; ridge + smooth: r = 0.911, p < 0.001, N = 28). The over-prediction of observer performance for the DET model with AVG + MAX pooling is due to the DET mode of operation not accounting for internal noise on a trial-by-trial basis. By contrast, this noise is accounted for when these model is operating in STO mode, since the model is explicitly optimized to predict the proportion of correct responses at each level. The degradation in model performance for SUB-DET pooling (leading to better agreement with the observer) may be due to the fact that this model is getting less information from the first-stage filters than AVG or MAX pooling, which integrate over all first-stage filters instead of a relatively small subset of them. In an additional analysis of Experiment 1-FIX, we systematically varied the downsampling rate for all three pooling rules (S12 Fig). We find that when operating in DET mode, for low rates (8x8) the SUB model drastically under-predicts observer performance, while for higher rates (16x16, 22x22) the SUB model performance improves and more closely matches observer performance. By contrast, we see in S12 Fig. that performance of the DET mode AVG and MAX models consistently over-predicts performance for all down-sampling rates, since regardless of down-sampling rates both pooling rules integrate over all of the first-stage filters rather than a limited subset, making them highly robust to random variations in micropattern placement.
Double-pass analysis. To obtain an upper bound on the performance of the model, we carried out a "double-pass" analysis [4,58,59] by testing observers twice on the same set of stimuli, and comparing model-observer agreement (proportion of trials with consistent responses) on the double-pass set to observer-observer agreement between the two passes. For 5 observers in Experiment 1-VAR, a fold of 1000 stimuli was used as the double-pass set, and the model was trained on the remaining 3000 stimuli. The results (Fig 8) show that the AVG-DET or MAX-DET models agree with the observer on the two passes, indicating that the observer is self-consistent on the two passes. For AVG-DET, both priors give a small, statistically insignificant (rank-sum test) median difference (ridge: median difference = 0.022, p = 0.063; ridge + smooth: median difference = 0.009, p = 0.313, N = 5), with most or all of the (binomial proportion difference) confidence intervals bracketing zero (ridge: 4/5; ridge + smooth: 5/5). Similarly, MAX-DET gives small median differences (ridge: median difference = 0.018, p = 0.125; ridge + smooth: median difference = 0.02, p = 0.313, N = 5) with most confidence intervals (ridge: 4/5; ridge + smooth: 4/5) including zero. However, SUB-DET (Fig  8a, bottom panels) gives somewhat larger differences (ridge: median difference = 0.052, p = 0.063; ridge + smooth: median difference = 0.024, p = 0.063, N = 5), although they fail to reach statistical significance. In the SUB-DET case with the ridge prior (red), 0/5 confidence intervals include zero, while 3/5 do so for ridge + smooth (green). Implementing the decision rule stochastically (AVG-STO) yielded somewhat poorer double-pass agreement (Fig 8b), although the difference fails to reach significance across the set of observers (ridge: median difference = 0.075, p = 0.063; ridge + smooth: median difference = 0.088, p = 0.063, N = 5). For AVG-STO, we find that for both priors, 0/5 confidence intervals contain zero.
Why does the stochastic (STO) decision rule (with AVG/MAX pooling) lead to better prediction of observer performance (Fig 7), and yet worse double-pass agreement (Fig 8) than the deterministic (DET) rule? To understand why, suppose for a given right-oblique stimulus the observer gave the response (R) on the first trial, and due to internal noise is 80% likely to respond R on the second trial, yielding a self-consistent response (R, R) on both passes. Suppose that for this stimulus the output of the model is P(R) >0. 5. Then the DET model will always return R, and therefore be in agreement with the observer's choice on the second pass 80% of the time. In contrast, the STO model will return R on only a subset of the trials, which will be uncorrelated with the observer's responses. Therefore, the STO model will have a lower rate of double-pass agreement, even if it better matches the overall proportion correct.
Decision variable correlation analysis. Although a strong model-observer agreement for the percentage correct is consistent with accurate modeling of the observer's visual processing, it is not fully compelling. The model and observer might both choose category A or B the same proportion of trials, without necessarily exhibiting any correlation between observer and model responses on a trial-by-trial basis. This situation could arise, for example, if the observer and model were utilizing different stimulus features for their decisions. To more thoroughly quantify model performance, we implemented a decision variable correlation (DVC) analysis, which estimates the trial-by-trial correlation between a model's decision variable and the observer's hypothesized decision variable [60]. This method extends standard signal detection theory by hypothesizing that the joint distribution of model and observer decision variables for each stimulus category (A = left-oblique/L, B = right-oblique/R) can be modeled as bivariate Gaussian distributions having correlation parameters ρ A , ρ B . Using the proportions of trials on which the observer and model decisions agree for each stimulus category, ρ A and ρ B are estimated using a simple maximum likelihood procedure [60].
DVC analysis is most accurate (as measured by bootstrapped estimates of SEM) for large numbers of trials at a constant stimulus level sufficiently near threshold to give consistent proportions of the four logical possibilities (model says A/B, observer says a/b) [60]. Therefore we applied DVC analysis to Experiment 1-FIX, which entailed thousands of trials (6000) at a single stimulus level for each of three observers (CJD, MAK, JJF). The model was trained on 5000 trials and the DVC computed using the remaining 1000 trials, for 6 choices of hold-out fold. Since the two boundary orientations provide independent estimates of the DVC, this provided 12 estimates of DVC from each observer. Fig 9 shows the resulting DVC values (plotted in rank order within each set, for graphical clarity) measured for AVG downsampling and both For AVG downsampling, 32/36 (ridge) and 33/36 (ridge + smooth) estimates of DVC are significantly (> 1.96 SEM) larger than zero, with DVC mean +/-SD of 0.234 +/-0.094 (ridge) and 0.254 +/-0.095 for (ridge + smooth). MAX downsampling gives similar results, where 33/ 36 (ridge) and 35/36 (ridge + smooth) estimates of DVC are significantly greater than zero, with DVC mean +/-SD of 0.245 +/-0.094 (ridge) and 0.263 +/-0.086 (ridge + smooth). Interestingly, SUB downsampling gives weaker correlations with only 20/36 (ridge) and 24/36 (ridge + smooth) estimates of DVC significantly greater than zero, and DVCs have mean +/-SD of 0.147 +/-0.082 (ridge) and 0.164 +/-0.085 (ridge + smooth).
Therefore, we see that not only are the models and observers in agreement about overall proportions of each category, but they are also correlated in their trial-by-trial decisions, consistent with the interpretation that the models and human observers are performing similar kinds of processing.
Ideal-observer analysis. To compare the spatial summation and task performance of human observers to those of an ideal observer, we trained the model on the true stimulus category labels rather than observer classifications of these stimuli. Notably, the resulting ideal observer models make use of a region-based spatial summation similar to that of human observers (Fig 3c). Examining this more systematically, a comparison of human performance to that of the deterministic (16x16 AVG-DET) ideal observer in Experiment 1-VAR (S14 Fig) revealed that across 6 observers analyzed, most confidence intervals (binomial proportion difference test, 95% confidence interval) did not contain zero (ridge: 6/24; ridge + smooth: 3/24), with those containing zero coming mainly from two observers (AMA, SRS). In contrast, for the stochastic ideal observer (AVG-STO) many more confidence intervals contained zero (ridge: 11/24; ridge + smooth: 10/24), suggesting a better agreement with observer performance. Similar results were obtained for the 3 observers tested in Experiment 1-FIX (S14 Fig).
Comparing human to deterministic ideal observer (AVG-DET) performance, most confidence intervals did not contain zero (ridge: 5/18; ridge + smooth: 3/18), while for the stochastic ideal observer most did contain zero (ridge: 12/18; ridge + smooth: 15/18). On the whole, this result is similar to our findings for the models fitted to observer data-i.e. a general tendency for the AVG or MAX models operating deterministically to over-predict performance, while these models operating stochastically seemed to agree reasonably well.
Correlations between observer performance and model parameters. The different human observers exhibited substantial individual differences in overall performance as measured by proportion correct (Fig 7), raising the question whether there might be corresponding variations in the fitted model parameters. We assessed this possibility for the results in Experiment 1-VAR, using the 16x16 AVG model.
Although the nonlinearity was always expansive (exponent α > 1.0), there was no compelling evidence across the set of individuals that the magnitude of α explained individual variations in performance, based on Pearson's r (ridge: r = -0.458, p = 0.302; ridge + smooth: r = -0.787, p = 0.036, N = 7). The significant result for the ridge + smooth case was due to the presence of an outlier-computing these correlations using Spearman's rho revealed no significant effect (ridge: rho = -0.357, p = 0.444; ridge + smooth: rho = -0.214, p = 0.662). Results were similar for MAX and SUB models (S5 Table).
However, individual variability in observer performance was very strongly related to the magnitudes of the fitted second-stage filter weights. That is, correlating the sum of the norms of the filter weights with overall performance across the 7 observers yielded strong, significant positive correlations (ridge: r = 0.970, p < 0.001; ridge + smooth: r = 0.993, p < 0.001), with similar results for MAX, SUB models (S6 Table). Consistent with this, the ideal observer had a larger filter norm (ridge: 5.67; ridge + smooth: 5.37) than the human observers (ridge: median = 4.30, min = 2.36, max = 4.59; ridge + smooth: median = 4.04, min = 2.27, max = 4.34, N = 7), as revealed by a sign-rank test (ridge: p = 0.018; ridge + smooth: p = 0.018).
In our model, the inverse internal noise level is represented implicitly by the magnitudes (vector norms) of the second-stage filters (see Methods for details). Therefore, we should expect observers with larger second-stage filter norms to also have better double-pass agreement, which is a function of internal noise. We find that there are indeed strong, significant positive correlations between filter norm and double-pass agreement for all pooling rules and choices of prior (AVG-ridge: r = 0.954, p = 0.012, ridge + smooth: r = 0.942, p = 0.017; MAX-ridge: r = 0.964, p = 0.008, ridge + smooth: r = 0.956, p = 0.011; SUB-ridge: r = 0.940, p = 0.017, ridge + smooth: r = 0.943, p = 0.016). Taken together with significant positive correlations between observer performance and double-pass agreement (r = 0.949, p = 0.014), we conclude that variations in filter norm magnitudes across observers are related to variations in internal noise levels across observers, and this variability in internal noise levels contributes to difference in observer performance.

Model consistency
The perceptual filters obtained in Experiment 1 (Fig 3) seem consistent with the idea that observers are monitoring the entire stimulus region for each of the two possible boundary orientations. However, to rule out the possibility that such filters could also be observed if observers were utilizing some different kind of spatial processing, we implemented four alternative varieties of spatial summation as ideal observer models fit to category labels. We then fit the model architecture in Fig 2 to data from these ground truth ideal observers in order to assess whether our methodology is consistent in the sense that it recovers the actual set of filters generating the data.
One possible alternative spatial summation that could mediate our task would be to simply monitor two adjacent "pizza slice" shaped quadrants defined by the four compass directions (2-slice). For instance, if one is monitoring the North and East slices, a difference in contrast between these slices is diagnostic for a right-oblique boundary. Implementing an ideal observer that utilizes such a subset of the spatial information (Fig 10a, top) and fitting our model architecture shown in Fig 2 to this data reveals consistent results: A single filter which monitors two slices only. Implementing an observer who monitors three adjacent slices ("3-slice"), for instance N+E and N+W (Fig 10a, second row) also reveals consistent filters. Another possible spatial processing approach would be to observe all slices but to only monitor for one of the two potential boundary orientations ("1-filter")-simulating this model yields a consistent estimate of a single filter tuned to the monitored boundary (Fig 10a, third  row).
One can in principle observe filters somewhat similar to those we recover in Fig 3 by using a sub-optimal spatial processing model which only monitors a single pair of potentially informative slices, chosen at random on each trial. However, such a model leads to perceptual filters which are extremely "noisy" compared to those we find (Fig 10a, bottom row), so it is quite unlikely that our observers are actually using such spatial processing. The simulated ideal observer that produces filters providing the best match to those we obtain from human observers (Fig 3) makes use of two filters, each of which monitors the entire stimulus region for each of the two stimulus possibilities (Fig 10b).
Summary and interpretation. From our analysis of model accuracy, we may draw several broad conclusions. Firstly, accuracy is not greatly affected by choice of prior, with similar results obtained using the ridge and ridge + smooth priors. Also, from our analyses we cannot favor a particular choice of pooling rule among those tested (AVG, MAX, SUB). Almost identical overall results are obtained using AVG and MAX pooling, and although in some cases we see lower accuracy using SUB pooling, our results suggest this can be ameliorated by simply increasing the down-sampling rate (S12 Fig). In Experiment 1-VAR, where the stimulus levels were fixed for all observers (rather than set at their individual thresholds), we observed great variability between observers in both task performance and double-pass agreement. This variability is reflective of differences in internal noise across observers, and is captured by our model. Finally, simulations using ideal observers reveal that our modeling process is consistent, accurately recovering the perceptual filter shapes which generate the data (Fig 10). The strongest agreement with our human results (Fig 3) is obtained by an observer monitoring the whole stimulus region for each of the two possible stimulus alternatives. Simulated ground-truth observers (left columns) and filter shapes recovered (right columns) from simulated datasets generated by these observers. (a) Ideal observers implementing various sub-optimal spatial filtering models for Experiment 1. Top row: Simulated observer monitors two adjacent "pizza slice" shaped regions (2-slice). Second row: Simulated observer monitors three adjacent regions (3-slice). Third row: Simulated observer only monitors one boundary (1-filter). Bottom row: Simulated observer randomly monitors 1 of 4 pairs of informative adjacent pizza slices (2-slice-random). (b) Ideal observer implementing a spatial filtering model comprised of two perceptual filters, each monitoring one potential boundary (2-filter). The recovered filters (right) are most similar to those observed in our results (Fig 3). https://doi.org/10.1371/journal.pcbi.1006829.g010

Experiment 2: Boundary orientation discrimination
The first experiment demonstrated that human observers employ region-based processing for the orientation identification task. To see if our modeling method could also reveal situations where observers employ edge-based processing, we utilized a fine orientation discrimination task (Fig 1c, right). The orientation of a second-order boundary was perturbed slightly (JND at 32% contrast modulation-see Methods) from vertical, and observers judged the direction of perturbation (i.e. slightly left-vs right-oblique). In one variant of this task (Experiment 2-VAR), contrast modulation was varied in 9 logarithmic steps from 16% to 64% centered at 32% (plus zero modulation) to vary the task difficulty. In a second variant (Experiment 2-FIX), contrast modulation was fixed at 32%. In both variants of Experiment 2, the perturbation of orientation was fixed near its JND for each observer at 32% contrast modulation (CJD: 7 deg., JJF: 6 deg., VHB: 7 deg.). In the data analysis, a higher downsampling factor (22 x 22) was used to better reveal fine spatial details, and only AVG pooling was considered, since Experiment 1 demonstrated substantial robustness of spatial summation to pooling rules (S8 and S9 Figs). Fig 11 shows the second-stage filter weights from an observer (JJF) tested in both Experiment 2-VAR (Fig 11a) and Experiment 2-FIX (Fig 11b). Unlike in Experiment 1, here observers employ an "edge-based" summation, assigning substantially more weight to texture regions near the boundary. Bootstrapped significance plots for individual weights are shown in S15 Fig, and consistent results from other observers are shown in S16 Fig. The edge-based processing employed by these observers is very similar to that obtained from an ideal observer (S17 Fig). As with Experiment 1, we find a reasonably strong agreement between model predictions and observed performance on sets of novel data not used for model parameter estimation (S18 Fig), and find an expansive second-stage filter nonlinearity for Experiment 2 (S7 Table). Taken together, Experiments 1 and 2 show that our modeling methodology can reveal different kinds of spatial processing that observers may employ for different kinds of psychophysical tasks.
We also considered the possibility that our results in Experiment 2 (Fig 11) could also be explained by two off-orientation filters defined on the spatial scale of the whole stimulus [61]. We defined an ideal observer with two off-orientation filters (+/-20 deg.) defined on the scale of the stimulus and simulated responses of this observer to the stimuli used in Experiment 2. This analysis yielded very different filters from those we observe in Fig 11, as we simply recovered the original off-orientation filters (S22 Fig), again demonstrating the consistency of our method (as in Fig 10). Based on these results, we suggest that it is more likely that the relevant neural mechanisms employed in Experiment 2 are contrast-modulation tuned neurons with smaller receptive fields localized to the boundary (see Discussion).

Experiment 3: Integrating contrast modulation across orientation channels
In Experiments 1 and 2, we considered a particular FRF model architecture and focused on characterizing the spatial processing that observers employ in different second-order vision tasks. However, one of the most powerful applications of this methodology is model comparison-i.e. comparing the ability of alternative models, embodying different assumptions about an FRF model architecture, to account for psychophysical data. Indeed, a major focus of much of the second-order vision literature has been to design experiments whose express goal is to test competing hypotheses of FRF architecture [29,[39][40][41]62].
We applied our method to ask how second-stage filters for contrast modulation integrate orientation modulations across multiple kinds of first-stage filters or channels. In many previous computational studies of second-order processing, the FRF models were comprised of second-stage filters, each of which only analyzes a single orientation/spatial frequency channel [28,63,64], with subsequent "late summation" of their signals. However, other psychophysical studies have suggested "early summation" FRF models, in which second-stage filters integrate contrast modulation across texture channels [39,40,62].
To address this question systematically, we used a boundary orientation identification task as in Experiment 1, but with the carrier pattern containing two orthogonal orientations of micropatterns (Fig 1a, bottom), which would be processed by distinct first-stage orientationselective channels. Since the nature of spatial summation is not critical to this question, we used a coarser downsampling factor (12 x 12), and as above, only AVG pooling was used. We considered two models of how information is integrated by the second-stage filters. In a "late summation" Model 1 (Fig 12a, top), each first-stage channel is analyzed by its own pair of second-stage filters (L/R oblique). Alternatively, in an "early summation" Model 2 (Fig 12a,  bottom), each second-stage filter receives inputs from both first-order channels. As in Experiments 1 and 2, we find that both the models are in reasonably good agreement with observer performance on the task (S20 Fig). Also there was reassuring consistency in that both the fitted models exhibited similar region-based weight maps, with an expansive second-order nonlinearity (S8 Table). When comparing Models 1 and 2 using standard model comparison techniques such as the Bayes Information Criterion, or BIC [42,44], there is no need to correct for overfitting since both models have the same number of parameters. This is because when parameter space dimensionality is equal, the complexity-penalizing terms of the BIC cancel, so that Δ BIC = ln That is, when two models are of equal complexity and equally likely a priori, we simply choose the one with higher log-likelihood of the data. The difference in log-likelihoods (Δ BIC ) is identical to the natural log of the Bayes factor B 21 ¼ p 2 ðDÞ p 1 ðDÞ . Standard effect size conventions state that when 2 ln B 21 > 10 (corresponding to > 150-fold higher probability of Model 2), the preference is considered to be "very strong" [43]. We see in Fig 12b that for all three observers tested in Experiment 3, the evidence (measured by 2 ln B 21 ) in favor of the orientation opponent model (Model 2) exceeds the standard criterion (10) for a very strong preference. Ideal observer analysis also revealed a very strong preference for the orientation-opponent model fit to stimulus category labels (2 ln B 21 = 478.4). With a large number of trials, a strong preference for one binomial response model does not require a huge difference in predicting observer performance (Appendix A). We see in Fig 12d that although Model 2 very consistently does a better job of matching observer performance than Model 1, that the differences between predicted proportions correct are fairly small.
We also wanted to compare the predictive abilities of the two models using a cross-validation approach, by evaluating their agreement with observer performance on data not used for training. Therefore we performed a bootstrap analysis in which both models were fit on randomly selected samples of 4000 stimuli and tested on the remaining 1000 (50 bootstraps). To optimize the generalization performance, each model was trained using the value of the ridge regression hyperparameter λ which yielded the best generalization (S19 and S21 Figs). The results in Fig 12c show a consistent preference across observers for Model 2 (magenta line). Performing a 3 x 2 factorial ANOVA (3 observers, 2 models, 50 observations each) on the test set log-likelihoods revealed significant main effects of both observer (F 2,294 = 201.085, p < 0.001) and model (F 1,294 = 39.811, p < 0.001), but no significant interaction (F 2,294 = 3.001, p = 0.051). In summary, both analysis approaches (Fig 12b and 12c) show a strong and consistent preference for the early summation version (Model 2).

Discussion
Second-order stimuli present a challenge for computational models of visual processing, as they do not contain spatial variations in stimulus intensity and thus cannot be detected by simple linear filtering operations [19]. Consequently, standard "classification image" techniques [2,5], which assume a linear spatial summation model, are not readily applicable to studies of second-order visual processing. Here we define a simple hierarchical neural network implementing a Filter-Rectify-Filter (FRF) model of second-order visual processing [20,25,26,65,66]. We estimate its parameters directly from psychophysical trial data, and show that it can accurately predict observer performance on novel test data not used for training. We demonstrate the utility of this modeling approach by testing whether observers segment contrastdefined boundaries using an edge-or region-based processing, finding support for regionbased segmentation in a boundary orientation-identification task (Experiment 1), and edgebased processing for a boundary discrimination task (Experiment 2). In addition, we show how this methodology can adjudicate between alternative FRF model architectures (Experiment 3).

Model complexity vs. biological realism
Psychophysical system identification is critically limited by the modest amount of data that can be collected in an experiment. Although constraints or priors can help alleviate the demand for data [51,67,68], it is also often necessary to fit simplified models requiring fewer parameters, which omit many known biological details. Here we discuss several modeling simplifications which may need to be addressed when extending this methodology to more complicated second-order stimuli and tasks.
First-stage filters. In FRF models like those considered here, the initial stage of filtering and rectification is meant to model computations most commonly believed to be taking place in visual area V1 [69]. We modeled the first-stage filters using a standard energy model of V1 complex cells with normalization [49,70]. Although for our stimuli a front-end comprised entirely of oriented energy filters was perfectly adequate, for many texture segmentation tasks it may be necessary to model half-wave rectified filters (both even and odd phase) as well. For instance, human observers can segment second-order images using cues of carrier contrast phase and polarity [25,71], which is not possible using only energy filters, whose responses are invariant to phase and polarity. Likewise, it may also be desirable to include center-surround "difference-of-Gaussian" (DOG) filters as well, since early visual cortex has many neurons with non-oriented receptive fields [72][73][74]. Human psychophysics has indicated that non-oriented filters may potentially serve as a mechanism for polarity-based segmentation [41], and computer vision work on natural image segmentation found it advantageous to include nonoriented as well as oriented filters [17].
In Experiments 1 and 2, we implemented only a single spatial frequency and orientation channel matched to the carrier, and in Experiment 3 we only considered two such channels. This simplification neglects the possible contributions of off-orientation and off-spatial frequency channels to task performance. Indeed, previous psychophysical studies have challenged the notion that second-order mechanisms are narrowly tuned for a single first-order channel, with many studies suggesting broad bandwidth for carrier spatial frequency [29,75,76]. On the other hand, single unit neurophysiology has demonstrated neurons that are selective for the carrier spatial frequency of CM stimuli, both in macaque V2 [38] and in cat Area 18 [77].
Second-stage filters. The main focus of our study was demonstrating an approach to estimate the properties (shapes and nonlinearities) of a set of second-stage filters which detect higher-order visual features. For our task involving contrast-modulated boundaries (Fig 1c), the simplest possible model involves two second-stage filters representing the two potential stimulus categories, much like the model shown in Fig 2. However, for psychophysical tasks involving more complex second-order stimuli such as natural occlusion boundaries (Fig 1a, top) defined by multiple cues which differ on opposite sides of a boundary, it may be necessary to have both a wider variety of first-stage filters as well as multiple kinds of second-stage filters sensitive to different feature combinations (Fig 13). In addition, it may be pertinent to allow direct connections between the first stage of filtering (albeit for much lower spatial frequencies) and the output (decision) stage, since natural region boundaries contain both first-and second-order cues [14,[16][17][18] which can be integrated in psychophysical tasks [46,78].
An additional limitation of our model (and many other FRF variants) is the assumption of only a single stage of higher-order filtering beyond the first stage. Recent neurophysiological and fMRI studies have suggested that texture selectivity may develop progressively at successive stages of the visual hierarchy from V1 to V2 to V4 [79][80][81][82][83][84]. In studies using convolutional neural networks to solve image recognition tasks, texture-sensitive units often develop gradually over several processing layers [85][86][87][88]. Furthermore, computer vision work on boundary detection has also found it useful to employ additional "texton" layers of processing [17]. Therefore, it may be of interest for future work to consider models having additional intermediate layers of processing.
A number of previous studies have considered how information from an array of firstorder visual filters may be pooled for perceptual decisions [11,89]. In our analysis, we considered three different pooling rules inspired by previous work in computational vision science.
We found that the spatial processing for the second-order vision tasks we examined was highly robust to choice of pooling rule (S8 and S9 Figs), and that we could model our data accurately with different choices of pooling rule (Fig 7, S12 and S13 Figs). Therefore, we remain agnostic about the "best" model of pooling between the first and second stage filters, and suggest that this may be an interesting topic for future investigation.
Nevertheless, despite these limitations and our numerous modeling simplifications, in practice we found that for the tasks and the stimuli used here our simple FRF model (Fig 2) was able to capture human performance quite well (Figs 6-9).

Comparison with previous psychophysical modeling
Despite the simplifications made in the present study, it represents a substantial methodological advance over previous efforts to model second-order visual perception. Although previous studies have developed image-computable FRF models, they have generally fixed not only the model architecture but also the shapes of the second-stage filters. The present study demonstrates how the second-stage filter shapes can be "learned" directly from experimental data in order to reveal the nature of spatial integration used in different tasks (Experiments 1 and 2).
Most previous FRF models in human psychophysics studies were conceptual or qualitative, without quantitative fitting of model parameters to data. Two recent studies [28,90] estimated free parameters of an FRF model by fitting threshold data as functions of stimulus parameters. Since thresholds are typically determined at only a few levels of a one-dimensional independent variable, this yields relatively few data points, greatly limiting the number of model parameters one can estimate. In our work, we directly estimate hundreds of model parameters by fitting stimulus-response data from thousands of psychophysical trials, enabling us to characterize the shapes of the second-stage filters. In addition to extending the FRF modeling literature, our work also complements and extends previous classification image studies, which have generally been applied in the context of first-order vision [4,5]. Although classification images yield image-computable predictive models of observer responses, relatively few studies validate these models by testing their predictive performance on a set of stimuli not used for model estimation. We tested model performance on novel stimuli not used for model estimation using both standard measures, e.g. percentage correct, as well as more advanced methods such as double-pass analysis [58,59] and decision variable correlation [60].
Finally, in contrast to most classification image studies, here we systematically address the issue of model comparison. Psychophysical system identification can be formulated in the framework of estimating a generalized linear model [51,67,68], which is unique up to choice of link function. Therefore, issues of model comparison (beyond coefficient shrinkage) do not naturally arise in this context. However, the FRF models relevant to second-order vision can vary greatly in architectural details (e.g., number of filtering layers, nonlinearities, connectivity, etc.), and model comparison is of great interest. In Experiment 3, we compare two competing FRF architectures making different assumptions about how the second-stage filters integrate across orientation channels, finding a preference for a model with early summation of first-order channels by the second-stage filters, consistent with previous psychophysical work [39,40,76].

Edge-versus region-based segmentation
We applied our methodology to the question of whether or not observers take a region-based or an edge-based approach to segmenting a contrast boundary (Experiment 1). To the best of our knowledge, our experiment is the first time the issue of edge/region-based processing has been addressed for boundaries defined by differences in contrast, as it has for other secondorder boundaries [55][56][57].
This question is of particular interest in light of several studies that have revealed possible neurophysiological mechanisms for contrast-boundary detection [36][37][38]91]. This work has demonstrated a population of contrast-modulation (CM) tuned neurons in early visual cortex, which exhibit band-pass spatial frequency and orientation tuning to second-order modulation envelope frequency, consistent with a Gabor-shaped second-stage filter applied to the outputs of a bank of first-stage filters selectively tuned to the carrier pattern [19].
If such CM-responsive neurons were the mechanism responsible for psychophysical detection of CM boundaries, one might expect the 1-D weighting profiles to have a Gabor-like shape, placing greater weight on regions near the boundary. However, we find that this was not the case and that all areas of the image are weighted equally for the orientation-identification task (Experiment 1, Figs 3 and 4). Furthermore, in an additional control manipulation where we increased the carrier frequency in order to make sure CM neurons with smaller RFs would be relatively more engaged (Experiment 1-HFC), we still did not see any change in the type of spatial integration (Fig 5). These findings are consistent with at least three distinct possibilities (not mutually exclusive) for the underlying neural mechanisms. One possibility is that task performance reflects CM-responsive neurons, like those described in V2 with large receptive fields (> 4 dva) encompassing the entire stimulus used in our experiments. Such neurons would prefer a carrier frequency much higher than their envelope frequency, which is plausible since primate V2 CM-responsive neurons can have preferred carrier:envelope ratios of up to 40:1 [38]. A second possibility is that psychophysical performance is mediated by a range of Gabor-like second-order mechanisms covering a range of spatial scales, and that our measured 1-d profiles reflect an ensemble of their contributions. Another possibility is that an important role is played by other neurons sensitive to texture patterns, such as those described in ventral stream areas [81][82][83]92].
Clearly, the fact that we can readily distinguish non-juxtaposed textures implies that there must be mechanisms for texture representation that operate in the absence of a boundary. However, in some cases texture segmentation can benefit from boundary information, as in the textural Craik-Cornsweet illusion [55]. For example, [57] demonstrated that direct juxtaposition of textures having different orientations gives rise to local junction or corner cues that can support segmentation. Such enhanced segmentation might be mediated by neurons such as those described by [93,94], which are selective for such cues.
However, other work on texture segmentation has shown a minimal contribution of information along the boundary. A study using natural image patches in a task of discriminating occlusion boundaries from uniform textures found that removing the boundary region had little effect on performance [18]. However, performance on natural occlusion edge and junction detection tasks can be seriously degraded when textural information, which is by its nature defined over a region, is removed [18,48]. Similarly, segmentation of complex textures defined by letters shows little effect of boundary information [56].
In general, the answer to this question is most likely dependent on the particular stimulus and the nature of the psychophysical task, as indicated by the very different results of our first two experiments. In our view, the question of "edge" versus "region"-based processing should be treated with some caution, since clearly for a very large stimulus extending into the far periphery there will be reduced weighting at the largest eccentricities, since visual acuity falls off sharply outside the fovea. In such a case the real question of interest would be the shape of the fall-off in weight as a function of stimulus eccentricity, and how this fall-off may relate to putative neural codes. Our method is well suited to address such issues for many different kinds of second-order boundaries and tasks, much as standard classification images can be similarly used for first-order boundaries.
We further demonstrate that our modeling methodology is consistent, meaning that it can accurately recover the actual perceptual filters employed in a task, provided the fitted and generating models have corresponding architectures. Simulation of ideal observers implementing sub-optimal spatial processing, e.g. utilizing only limited subsets of the texture, revealed filters which do not match those we observed experimentally (Fig 10a). In contrast, a strong agreement was found with results from simulating an observer monitoring the whole stimulus for each of two possible boundary orientations (Fig 10b), strengthening our interpretation that this is the kind of spatial summation actually being employed.
In Experiment 1 our results showed that human observers use every part of the image to perform the task (Fig 3), consistent with an ideal observer that integrates information from all image regions (Fig 10b). In terms of possible mechanisms, we might therefore expect CMtuned neurons with larger receptive fields that integrate over the expanse of the stimulus image to be engaged in Experiment 1. The results in Experiment 2 showed that only the texture information near the boundary was being utilized by humans (Fig 11), again consistent with an ideal observer that is indifferent to the stimulus elsewhere. This suggests that the most useful neural substrate in Experiment 2 might be CM-responsive neurons with smaller receptive fields localized near the boundary, that integrate texture only over small extents to either side of the boundary.
We demonstrate using simulation that one cannot explain the filters we observe in Experiment 2 by assuming two large off-orientation receptive fields (S22 Fig), since otherwise we would have recovered filters with slightly non-vertical orientation rather than filters localized to the vertical (Fig 11). We conjecture that our results are more consistent with the possibility that observers are making use of contrast-modulation tuned neurons having smaller receptive fields localized to the boundary.
In both Experiments 1 and 2 we found that humans behaved as ideal observers with regard to their spatial integration of texture. Such a result should not be regarded as inevitable or having little relevance for questions of neural mechanism. Previous psychophysical studies (including several classification image studies) have demonstrated that humans often do behave ideally in perceptual tasks [12,[95][96][97][98] and furthermore such ideal behavior can be suggestive of neuronal mechanisms underlying the behavior [99][100][101]. However in many other cases, human psychophysical performance does not match that of an ideal observer [2,3,8,96,98,102,103]. Interestingly, [96] demonstrated that even with the same stimulus, changes in the psychophysical task could alter whether or not human observers behaved ideally. We feel that the demonstration of ideal behavior in Experiments 1 and 2 is significant, and strongly suggests possible neural mechanisms, for example suggesting a greater involvement of neurons with larger CM-responsive receptive fields in the task of Experiment 1, and those with smaller receptive fields in Experiment 2.
In general, the relationship between neural response properties and the perceptual filters obtained in classification image studies is complex [4,104]. Perceptual filters do not generally correspond to fixed, "hard-wired" neural populations, but are most likely actively constructed from available neuronal populations through perceptual learning. Consistent with this idea, previous work has demonstrated that observers' perceptual summation gets successively closer to that of the ideal observer with practice [10,105]. It would be of great interest for future research to explore the dynamics of the perceptual filtering strategies that observers employ in our tasks.

Integration of heterogeneous first-stage channels
A common ingredient to the various FRF models proposed to account for psychophysical texture segmentation data is a set of fine-scale first-stage filters summed by one or more secondstage filters defined on a much larger spatial scale. However one aspect in which such models differ is whether the individual second-stage filters analyze first-stage filter responses at a single orientation/spatial frequency, or whether they integrate across heterogeneous first-stage channels. In one standard model of FRF processing, each second-stage filter only analyzes the outputs of a single first-stage orientation/spatial frequency channel [25,63,64]. However a number of psychophysical results have called this model into question. One study using a subthreshold summation paradigm found support for a model utilizing linear integration across carrier orientation for detecting contrast modulations [40]. Another study making use of energy-frequency analysis revealed the existence of orientation-opponent mechanisms in second-order vision [62]. Other work has suggested at least partly independent mechanisms for detecting second-order modulations defined by contrast and spatial/frequency orientation, with the suggestion that contrast modulations may be pooled across first-stage channels [39].
In Experiment 3 we find support for a model with second-stage filters that integrate across first-stage orientation channels, consistent with the latter psychophysical studies. Although CM-responsive neurons described so far exhibit narrow tuning for carrier orientation [38], this discrepancy could be due to the use of grating carriers rather than multi-orientation micropatterns. This suggests that that possible effects of carrier orientation bandwidth might be worth future investigation, both in human psychophysics and for CM-responsive neurons.
In our opinion, Experiment 3 best demonstrates the power of this methodology compared to traditional linear classification image approaches: The ability to explore a much larger space of non-linear perceptual models embodying various assumptions about neural mechanisms.
Rather than simply comparing recovered filter shapes to ideal behavior as in most previous classification image studies [5], here we directly compare two biologically plausible hypotheses of how contrast modulation is integrated across orientation channels. Our finding of clear support for an early summation model [40] suggests that an interesting direction for future neurophysiological investigation would be to examine neural responses to contrast modulations of carrier patterns containing multiple orientations. Such a study would be quite novel, as previous neurophysiological investigations of contrast modulation tuning have only considered a single carrier orientation [35,38].

Conclusions
Statistical machine learning has in recent years become a powerful and widely applied methodology in vision and in computational neuroscience. Our work demonstrates that the same machine learning methodology used to characterize neural coding [21,52,88,106,107] and first-order visual processing [5,51] can be fruitfully applied to extend psychophysical system identification approaches to complex visual stimuli, including second-order boundaries. We anticipate that future applications of this approach may lead to a better understanding of how multiple first-and second-order cues are combined by human observers to detect natural boundaries, and provide insights into neural and computational mechanisms of image region segmentation.

Experimental paradigms
Stimuli and psychophysical task. Contrast-modulated texture boundary stimuli were created similarly to those of [28], by multiplying a large (predominantly low spatial frequency) contrast modulation envelope by a high spatial frequency carrier (texture) pattern (Fig 1b). The modulation envelope was a circular region, half of which had a larger value (giving higher texture contrast) and the other half had a smaller value (producing lower texture contrast). This pair of half-discs was oriented either right oblique (clockwise) or left-oblique (counterclockwise). The boundary between the two regions was tapered smoothly, as was the outer edge of the circular region, so that it smoothly decayed into the background (see stimulus images in Fig 1). The functional form of the tapering was a half-cosine going from 1 to -1 over a proportion (0.2) of the stimulus width, as used previously [28,90,108]. Such tapering was a precaution against first-order artifacts.
The carrier patterns were produced by adding odd-phase Gabor micropatterns at random locations in an over-size larger image, which was subsequently cropped to the desired size. Only texture patterns having sufficient uniformity in root-mean-squared (rms) contrast were utilized, following the uniformity criterion of [28]. The rms contrast of the carrier patterns was set to 14%, which is 6 dB above previously observed average carrier detection thresholds [108]. Examples of right-oblique boundaries with varying carrier mircopattern densities and depths of contrast modulation are shown in Fig 1b. Boundaries between micropattern textures like these have been used in previous psychophysics studies [28,90], and offer flexibility to produce a wide variety of naturalistic texture patterns amenable to parametric variation.
In all experiments, observers performed a single-interval 2AFC task to determine whether a briefly displayed (100 msec) boundary stimulus was contrast-modulated in a right-or left-oblique orientation. Envelope phase was randomized across trials, to preclude its use as a cue. Visual feedback was provided after each trial to maximize observer performance. Data was collected over multiple (4)(5) sessions of 30-60 minutes, conducted over several days. An initial training session was dedicated to learning the orientation identification task (+/-45 degree) by performing a short version (300 trials) at various micropattern densities (256, 1024, 2048 patterns or 512, 1024, 2048 patterns). Consistent with previous studies on second-order edge detection [28,90], we found in preliminary experiments that task performance was best for carrier texture patterns having a high density of micropatterns. Therefore we used relatively high micropattern densities (2048 per 256x256 image for Experiments 1, 2 and 512 for Experiment 3), higher than in previous studies [28], with the goal of optimizing observer performance.
Psychophysical observers. Observers were author CJD and 7 members of his research group (adult undergraduate students) who were initially naïve to the experimental purposes. For Experiment 1, 4 additional observers were recruited from two Psychology classes (EXP-3202, PSB-4002) at Florida Gulf Coast University (FGCU). All observers had normal or corrected-to-normal acuity. All observers provided written informed consent and all procedures were approved beforehand by the IRBs at FGCU (Protocol 2014-01) and McGill University (Protocol MED-B-96-279), in accordance with the Declaration of Helsinki.
Displays and experimental control. FGCU experiments were run in one of two windowless vision psychophysics chambers with no light sources other than the computer. Chamber 1 contained a factory-calibrated (gamma-corrected) LCD (Cambridge Research Systems, Dis-play++, 120Hz, 1920x1080 pixels, 100 cd/m 2 ). Chamber 2 contained a gamma-corrected CRT (SONY FD-500 Trinitron, 75 Hz, 1024x768 pixels, 59.2 cd/m 2 ) driven by a Bits# stimulus processor (Cambridge Research Systems). In each chamber the display was controlled by a Dell OptiPlex 9020 with an NVIDIA GeForce GTX 645 graphics card. Observers viewed the stimuli binocularly, with a chin rest, at a distance of 133 cm, so that the 256x256 stimulus image subtended approximately 4 deg. visual angle (dva). McGill data was collected from author CJD for Experiments 2 and 3. Stimuli were displayed on a CRT (SONY Multiscan G400, 75 Hz, 1024x768 pixels, 53 cd/m 2 ) with gamma correction through the color lookup tables. This monitor was driven by an Apple Mac Pro 4,4 (2.26 GHz Quad Core) hosting an NVIDIA GeForce GT 120 graphics card. Binocular viewing was at a distance of 125 cm, so that the stimulus subtended 4 deg, and experiments were conducted in a room with normal lighting. We observed remarkable consistency in our results from the three setups, despite differences in monitor characteristics and room lighting conditions, so all data was pooled for analysis. Experiments were controlled using custom-authored software written in the MATLAB programming language (MathWorks, Inc.), making use of routines from the PsychToolbox Version 3.0.12 [109][110][111]. To speed up experiments, carrier patterns were pre-computed and stored at 8 bit precision on the host computer. A unique set of pre-computed carrier patterns was used for each experiment, so that a given observer never saw the same carrier pattern more than once. Experiment 1: Orientation identification. Stimuli were 256 x 256 pixels in size with a carrier pattern comprised of 2048 randomly placed, vertically oriented (0 degrees) Gabor micropattern stimuli (32 pixels or 0.5 dva., 1:1 aspect ratio). In this task, the two stimulus alternatives had right/left-oblique (+/-45 deg.) boundaries (Fig 1b and 1c). In the main version of Experiment 1, for 7 observers, contrast modulation depths were varied across 9 logarithmically spaced levels from 8% to 32%, centered at 16%, plus 0% modulation (Experiment 1-VAR), using a method of constant stimuli. This set of levels resulted in trials at or near threshold for most observers, while still including some relatively easy trials. After an initial training session, observers each performed a total of 4000 trials (4 blocks of 1000 trials, 100 trials per level) over the course of multiple sessions spaced over several days. For one observer (CJD), 2975 micropatterns were used instead of 2048.
For 3 observers (CJD and naïve observers MAK and JJF), contrast modulation depths were fixed at a value slightly above each observer's measured threshold (Experiment 1-FIX), with the goal of obtaining near-threshold task performance of approximately 80% correct (S12 Fig). This is more typical of most classification image literature [5], which often fixes stimulus strengths at or near task performance thresholds. These observers performed 6000 trials (6 blocks of 1000 trials) over multiple sessions spaced across several days.
As an additional variant of Experiment 1, we used a carrier texture composed of micropatterns with a much higher spatial frequency (8 pixels or 0.125 dva.) and a higher density (8192), with rms contrast remaining at 14%. Experiment 1-HFC was run on three naïve observers (LJW, JMM, VHB) tested at variable modulation contrasts (same levels as Experiment 1-VAR). Experiment 2: Fine orientation discrimination. In order to show the general applicability of our methodology to tasks where observers may employ an edge-based spatial summation, we ran a fine orientation discrimination task in which observers determined whether a clearly visible contrast-modulated boundary (same carrier parameters as in Experiment 1) was tilted slightly left-vs right-oblique relative to vertical. Experiment 2 is complementary to Experiment 1, in which the orientation difference is very large (+/-45 degrees from vertical), while the contrast modulation is near detection threshold, whereas in Experiment 2 the contrast modulation is well above detection threshold, but the orientation difference is very small. This experiment was run on 2 observers, one of whom (JJF) was naïve to the purpose of the experiment. In a preliminary experiment at 32% contrast modulation, a method of constant stimuli was used to measure a just-noticeable-difference (JND) for orientation discrimination (approximately 75% performance) for each observer (JJF: 6 deg., CJD: 7 deg.). Then in subsequent sessions, observers performed the fine orientation discrimination at this value for a set of stimuli whose contrast modulation was varied from 16% to 64% (center = 32%) with a method of constant stimuli. Observers performed 6000 trials of the task over multiple sessions spaced over multiple days (6 blocks of 1000 trials, 100 trials per level). We refer to this experiment as Experiment 2-VAR. For two naïve observers (VHB, JJF) we also ran Experiment 2-FIX, where the orientation difference was fixed at each observer's JND. Experiment 3: Orientation identification with a more complex carrier texture. Experiment 3 was the same as Experiment 1-VAR except using a carrier texture containing micropatterns with two orthogonal orientations, as in Fig 1a (bottom). For this experiment the micropatterns were elongated (2:1 aspect ratio instead of 1:1 as in Experiments 1, 2), to give a narrower orientation bandwidth. In addition the density was reduced (512 micropatterns instead of 2048), to create a stronger percept of oriented structure in the carrier texture.

Modeling approach
Modeling framework. Most previous psychophysical SI methods assume that the decision variable arises from a weighted linear sum of the values (e.g. pixel luminances) of the sensory stimulus [3,5], essentially modeling the entire chain of brain mechanisms involved in stimulus-processing and decision-making by a single linear filter followed by a threshold. Here we propose a more general approach which we refer to as psychophysical network modeling. In this scheme, the decision variable is a function of the stimulus-dependent activities of a set of "hidden units", each of which may be sensitive to different aspects of the stimulus. This idea closely parallels recent modeling of neural responses by combining the outputs of multiple nonlinear subunits selective for different stimulus features [21][22][23][24].
Given a classification task for a stimulus x 2 R d which elicits one of two behavioral alternatives (b = 1, b = −1), one may define a generic psychophysical model as where σ is a psychometric function, u(x, θ) denotes the stimulus-dependent input to the psychometric function whose parameters θ characterize the sensory processing model, and γ denotes the lapse rate. The stimulus inputs x may be either the stimulus parameterization itself (e.g. pixel luminances), I, or a transformed version of the stimulus parameterization (x = ϕ(I)).
We define a psychophysical network model by making the additional assumption that the input u(x, θ) to the psychometric function is some function of the activities s 1 , � � �, s K of a set of K hidden units, i.e.
uðx; θÞ ¼ Fðs 1 ; � � � ; s K ; vÞ; ðM:3Þ where v is a set of parameters characterizing how the s i are combined. Perhaps the simplest combination rule is a weighted linear summation, i.e.
Note that we can have more than 2 hidden units (K > 2) even though there are only two behavioral responses (e.g. model of Fig 12a). We assume that the hidden unit activities are a nonlinear function of the projection of the features x onto some filter w i , Various standard models used in machine learning are special cases of (M.6). For instance, standard binary logistic regression [44] used for simple yes/no detection corresponds to the case of γ = 0 and a single hidden unit whose nonlinearity is a simple linear pass-through (h(u) = u) with fixed output weight v 1 = 1, so that (M.6) reduces to pðb ¼ 1jθ; xÞ ¼ sðv 0 þ w T i xÞ: ðM:7Þ Likewise, with γ = 0 and the sigmoidal hidden unit nonlinearity h(u) = 1/(1 + e −u ), this model (M.6) becomes a three-layer connectionist neural network [44,112].
Filter-rectify-filter model. A psychophysical network model for detecting whether a second-order edge is oriented left-or right-oblique is illustrated in Fig 2. In this model, the inputs x are the down-sampled outputs of a bank of first-stage energy filters resembling V1 complex cells applied to the stimulus image I, a transformation which we represent as x = ϕ(I). There are two second-stage filters which serve as inputs to the hidden units. One hidden unit (L) learns a set of filter weights (blue lines) to detect a left-oblique edge from the pattern of activity in the first-stage subunits, while the other hidden unit (R) learns a set of filter weights to detect a right-oblique edge. The filter outputs are passed through a rectified power-law nonlinearity h (u) = |u| α (blue curve) to generate the hidden unit activities s L , s R . Note it is important for h(u) to be even-symmetric, i.e. h(u) = h(−u), to provide envelope phase-invariance-otherwise the model would require four rather than two hidden units, since in our psychophysical experiment the envelope phase is randomized on each trial.
The hidden unit activities s L , s R are combined, assuming fixed output weights (v R = 1, v L = −1), and a bias term v 0 is added to obtain a decision variable u = s R − s L + v 0 . The fixed output weights are a consequence of our choice of a power-law nonlinearity, since for all inputs x, filters w and output weights v, the term v|w T x| α is identical for all v, w satisfying vkwk α = C, where C is some constant and kwk denotes the norm of w-see [113,114]. Many variants on this basic FRF model architecture are possible, and fitting multiple models to a given dataset could permit us to investigate questions about FRF model architecture [29,40,41].
A model of the form (M.6) arises naturally where there are two signals representing the two stimulus alternatives whose activities are compared to make a decision, for instance, the outputs of the two subunits in Fig 2. Subtracting the activities s 1 − s 2 and adding a bias v 0 and Gaussian internal noise ε~N(0, ρ 2 ) models the proportion of times behavioral alternative b = 1 is chosen as where F is the cumulative unit normal distribution and μ = s 1 − s 2 + v 0 . Using the approximation F(u) � σ(ku), where σ is the sigmoid function and k is optimized to best approximate F (u), we can re-write (M.8) as internal noise parameter β is simply absorbed into the magnitudes of the second-stage filter weights w i and bias v 0 . Therefore, larger filter norms will correspond to larger values of β and lower internal noise. In the analysis here, the norms of the w i were not solely determined by the internal noise since they were also constrained by the Bayesian prior. However, we found that the model provided an excellent fit to our psychophysical data without explicitly optimizing an additional internal noise parameter, so with the exception of only one analysis for Experiment 1-FIX (S13 Fig) we did not optimize a noise parameter. Model estimation and regularization. Here we simplify notation by combining all model parameters and the lapse rate into the single parameter vector η. Given psychophysical stimulus-response data D ¼ fx i ; b i g n i¼1 , and prior constraints p 0 (η) (possibly uninformative) on the model parameters, we pose the estimation as a machine learning problem of finding the maximum a posteriori (MAP) estimate of η, given by: η ¼ argmax η ln pðηjDÞ ¼ ln pðDjηÞ þ ln p 0 ðηÞ: ðM:10Þ In this framework, the prior constraints p 0 (η) are often used to prevent over-fitting (Bishop, 2006), which can be a serious problem due to the relatively large number of parameters (hundreds or thousands) being estimated from a limited number (generally < 10K) of psychophysical trials. Various techniques have been applied to help reduce the problems of noise and overfitting, including Bayesian priors [51], radial averaging [8] or using simplified low-dimensional stimuli [10]. Here we considered two different Bayesian methods for regularization of the second-stage filter weights. Under a prior assumption that the weights are from a multivariate Gaussian distribution, a "ridge-regression" penalty [44] is applied to each of the secondstage filter weight vectors w i (Fig 2, blue lines)-this encourages small weight magnitudes, to prevent fitting noise. The prior is given by ðM:11Þ where w ij denotes the j-th element of filter w i and the hyperparameter λ controls the strength of the penalty (smaller λ induces a stronger penalty). Optimizing λ using k-fold cross-validation [44], λ = 0.1 − 1.0 provided a good trade-off between fitting to the training data and generalizing to novel data in the test set in Experiment 1-see S1 and S2 Figs. This parameter was re-optimized for each of the two models in Experiment 3 (which differed in architecture from those used in Experiments 1, 2) using the same procedure (S19 Fig). We also considered the effects of an additional prior q 0 (w i ) favoring second-stage weights that form a spatially smooth filter [115]. Such a prior was implemented by constructing a matrix S 2 whose rows are the 2-D discrete Laplace operator where p is the down-sampled image dimensionality. This smoothness prior is specified by with hyper-parameter ω controlling the extent to which roughness is penalized (smaller ω induces a stronger penality). Optimizing ω using k-fold cross-validation with λ = 1, there was a slight improvement in generalization when the smoothness prior (M.13) was also included, with the value of ω = 1 providing a good tradeoff between fitting the data and generalizing to novel test data in Experiment 1 (S3 and S4 Figs). Adding this additional smoothness constraint reduced fitting to the training dataset, but improved generalization of the model to novel data (S5 and S6 Figs). Numerical optimization. Second-stage filter weights w i and the exponent α of the hidden unit nonlinearity were estimated using alternating coordinate ascent in a manner similar to the EM algorithm [116]. Weights were optimized for one filter w i at a time while all other filters w j6 ¼i and the power law exponent α were fixed. The filter weights, starting with randomized initial values, were optimized using a quasi-Newton method (L-BFGS). To estimate the power law exponent α, all of the second-stage filters were fixed and a 1-D line search was performed over the range [0.1,4.0]. The optimizations for w i and α are not log-concave, which means that local maxima may be possible [117]. In practice we usually found robust and consistent estimates of model parameters across optimization runs starting from several starting points. We found that sometimes the search converged on a "local maximum" consisting of only a single non-zero filter. Therefore, as an additional precaution, after the first round of optimization (or at least 4 total rounds depending on the experiment) we always re-initialized the search at a point in parameter space where the filter that was smaller in magnitude was set to be mirror-symmetric to the current value of the larger one. The optimization was then allowed to proceed as usual for several more rounds before termination. This procedure had the effect of eliminating any problems with local minima.
In preliminary investigations the estimated lapse rate γ was always zero, so this parameter was not included in our final model. Previous work has shown the best estimates of lapse rates when a large number of easy trials are included [118]. This was not the case in our study (which focused trials near threshold), nor in the general classification image literature, which tends to use near-threshold stimuli and does not typically estimate lapse rates [5].
Let M 1 , M 2 be two binomial models, each predicting for a fixed stimulus, positive response (r = 1) probabilities of p 1 , p 2 and negative response (r = −1) probabilities 1 − p 1 , 1 − p 2 . Assume that we perform n experimental trials and observe proportion p � positive responses and (1 − p � ) negative responses. Let D = r 1 , . . ., r n denote the observer responses. The Bayes Factor is and taking the natural log we obtain where L k = p(D|M k ) for (k = 1,2). Expanding the log-likelihoods, we obtain Noting that the bracketed term in (A.4) is simply the negative cross-entropy between two binomial models, and using the fact that the cross-entropy C(p, q) relates to the entropy H(p) and the KL divergence D KL (p||q) by the relation C(p, q) = H(p) + D KL (p||q), we can re-write (A.4) as We see from the final result (A.7) that the Bayes factor magnitude is exponential in the number of trials (n), meaning that small differences in model performance (measured by the difference in the KL divergences) over a large number of trials can lead to very large values for the Bayes factor. Similarly, we see from (A.6) that the standard model selection criterion 2ln B 21 [43] grows linearly as n, meaning that a small difference in performance may accumulate over trials to produce large values of this criterion.
Concretely, consider the case where p � = 0.75, p 1 = 0.73, p 2 = 0.74 for n = 5000 trials. Qualitatively, we see that both models are slightly inaccurate, with Model 2 agreeing with the observer more than Model 1, but with a difference of only 0.01. Applying formula (A.6) gives us ln B 21 = 3.8458 and therefore 2ln B 21 = 7.6916, corresponding to a "strong" preference for Model 2 according to standard criteria [43]. Re-working this example with a slightly larger difference in performance (0.02) by setting p 1 = 0.72 yields a "very strong" preference for Model 2 (2ln B 21 = 20.2224), and a 2.46 × 10 4 fold preference for Model 2. This effect is even more pronounced for predictions of very high (or low) success rates: setting p � = 0.95, p 1 = 0.92, p 2 = 0.94 yields 2ln B 21 = 60.4679, greatly exceeding the standard criterion (10) for "very strong" evidence in favor of Model 2.  7. (a) Scatterplots of model vs. observer performance, averaged across observers, for all test folds (N = 18 folds, 6 per observer). We also show the stochastic AVG model with an internal noise parameter, β, optimized to fit observer performance (on the training set, which was not used for testing), denoted AVG-SOP, bottom left panel. Values of the noise parameter β for each observer are shown in S4 Table. (b) Difference between observer and model performance (observermodel) for each individual test fold (3 observers, 6 folds per observer) for all models shown in (a). Lines show 95% confidence intervals of the difference (binomial proportion difference test). (TIF)