Reconciling the statistics of spectral reflectance and colour

The spectral reflectance function of a surface specifies the fraction of the illumination reflected by it at each wavelength. Jointly with the illumination spectral density, this function determines the apparent colour of the surface. Models for the distribution of spectral reflectance functions in the natural environment are considered. The realism of the models is assessed in terms of the individual reflectance functions they generate, and in terms of the overall distribution of colours which they give rise to. Both realism assessments are made in comparison to empirical datasets. Previously described models (PCA- and fourier-based) of reflectance function statistics are evaluated, as are improved versions; and also a novel model, which synthesizes reflectance functions as a sum of sigmoid functions. Key model features for realism are identified. The new sigmoid-sum model is shown to be the most realistic, generating reflectance functions that are hard to distinguish from real ones, and accounting for the majority of colours found in natural images with the exception of an abundance of vegetation green and sky blue.


Introduction
A viewer sees an illuminated surface-the light arriving at their eye being determined by the intensity of the illuminant and the reflectance of the surface, both as a function of wavelengthand declares it 'green'. The causal sequence underlying this threads through Popper's three levels of reality [1]: world I, matter and sensation; world II, perception and cognition; world III, language and culture. Theories that link Popper's worlds, are an important goal of science [2][3][4]; and colour presents a 'goldilocks' case study for this: not too simple, like temperature; nor too complex, like shape [5,6].
There is more than one way to link Popper's levels. Most prized are mechanistic models linking adjacent levels, and much has been achieved in colour science: electromagnetic models describe the interaction of the illumination with the scene [7]; optical ones, the formation of the retinal image [8]; electro-physiological, the retinal cone transduction of light into a nerve impulse [9][10][11]); and neuronal, the re-coding of sensory responses into opponent channel representations [12]. When levels are skipped, mechanistic models are not feasible, but accurate predictive models may still be; for example, colour naming atlases [13,14] that link sensory responses to language. Sketchier yet, consistency between distant levels can be shown, for a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 example between the cognitive similarity structure of the Basic Colours Terms [15] and the spectral sensitivities of retinal cones [16].
This study aims for a consistency linkage-to establish a generative model of spectral reflectances functions. We will evaluate the fidelity of the model in two ways - • that its reflectance functions should look like natural ones, and • that the colours of those reflectances when illuminated should be distributed like those in the natural environmentusing machine learning and statistical methods to quantify and to compare rival models. As well as contributing to the scaffold of colour science that links across Popper's levels, such a model may be useful for other studies. For example in understanding colour constancy, in which illumination and sensation vary, while reflectance and perception are stable [17]; or in determining optimal gamut transformations between devices [18]. The relationship between the statistics of natural reflectance and natural colour has previously been studied using Principal Components [19,20] and Fourier Analyses [21]. PCA methods were used to synthesize reflectance functions as the weighted addition of a small (~6) number of components. The realism of these functions was assessed only informally but, ignoring that a few percent of such generated functions will have reflectance values outside the valid range [0,1], was clearly good. The distribution of colours of these reflectances was not assessed for naturalness. While PCA is conceptually simple, the model is not compact, requir-ing~200 values to be specified (mean + 6 components × number of wavelengths sampled).
The same study [19] that considered PCA for analyzing spectral reflectance also proposed that natural reflectances were effectively band-limited in the frequency domain. A fuller analysis [19,21] of the frequency domain characteristics of reflectance functions showed that they were better considered to have a power-law form decrease of energy with frequency, rather than a sharp cut-off. This led to a very compact model of spectral reflectance functions with only a few parameters (mean reflectance, spectral power, power-law exponent fall-off). This work only informally assessed the realism of the generated functions and their distribution of colours. As we will show later, with some adjustments they can both be very good.
In this paper we will introduce methods objectively to quantify the realism of models of spectral reflectance, in terms of individual functions and in terms of the distribution of their colours. We will apply these methods to previous PCA and fourier-based models, to understand their weaknesses, and will propose new models that are more realistic. We will achieve a model that is simple and realistic, which we hope will find use in future research.
The paper is organized as follows. In section 2 we present datasets of spectral reflectance and natural colour. In section 3 we describe the methods we use to assess the realism of a model of spectral reflectance. In 4 we describe formation models for the colour-from-reflectance process. Using these formation models we show that the reflectance and colour datasets are not very consistent. In 5 we describe previous models (PCA, fourier) of the distribution of spectral reflectance, and show that neither produce both realistic spectra and colours. In 6 we extend these previous models (PCA+, fourier+) which improves their realism, and we present a new sigmoid-sum model which is the most realistic. In 7 we discuss the key features needed for models to be spectrally and colour realistic, and the realism of the sigmoid-sum model. In 8 we summarize and conclude.

Spectral reflectance functions
A spectral reflectance function maps from the visible spectral interval Λ≔[380,720] nm to the reflectance range [0,1]. It specifies, as a function of the incident illumination wavelength, the energy of the light reflected by a surface as fraction of that reflected by a white lambertian surface.
For statistics on spectral reflectance we merge nine collections from two sources [22,23] (Fig 1). These collections vary in their wavelength range and sampling, but we extrapolate and interpolate them into a common scheme. Where extrapolation has been used (for example from 400nm to 380nm for the 'natural' collection) we consider this harmless for further analysis.
The collections vary in the type and number of materials measured. We presume that each collection was assembled with an eye to diversity, rather than as a random sample of a natural population. Overall, we consider the total dataset as informative of the appearance of reflectance functions (smoothness etc.) but not as a reliable model of their natural distribution.
We make a few informal observations on the functions shown in Fig 1. • Increasing segments are more common than decreasing.
• Most functions have fewer than five extrema. The 'natural' collection has additional high-frequency structure presumed to be measurement noise.
• The functions can be parsed as mostly made of plateaus and slopes, occasionally sigmoid steps, less frequently humps, and rarely dips.
The collections vary in their wavelength sampling. For consistency all functions were resampled (cubic interpolation) at 128 wavelengths equally-spaced across Λ. When appropriate each collection is weighted equally, nullifying the variation in their size.
We have studied the frequency power spectra of the functions, revisiting the analysis performed in [21]. Numerical frequency analysis assumes periodic signals, which is clearly not the case here. If unaddressed, the resulting spectra are strongly influenced by the wraparound discontinuity. We deal with this by removing a linear trend from each reflectance and then apodizing it with a Gaussian window (μ = 550nm, σ = 50nm). The removal step is computed so that the apodized function is free of linear trend. After detrending and apodization we use an FFT to compute frequency power spectra, remove the three lowest frequency terms (because corrupted by the detrending), normalize each power spectra to unit sum, average across a collection, and then square root to obtain an overall frequency amplitude spectrum.
In Fig 2 we plot the results as log amplitude vs. log frequency. As the figure shows, on such plots the amplitudes have a decreasing linear form for low to medium frequencies, and then a constant value for higher frequencies. The linear part arises from a power law form dependency of amplitude on frequency (i.e. a/f −γ ), while the constant part arises from measurement noise. We have fitted a model of this to the data (see Fig 2) which recovered a power law coefficient of γ = 2.3. Our results thus confirm the power-law frequency amplitude model proposed in [21], but there the estimate of γ = 1.8 was lower-larger coefficients corresponding to smoother functions.

Colour
For data on the colours present in the natural environment we use the RGB values of a large collection of images. This is a convenient alternative to the use of spectroradiometers [24] or hyper-spectral imaging systems [25]. While those methods do allow accurate prediction of the colour sensations provoked by scenes, their technical complexity makes it difficult to assemble large datasets. In contrast, RGB digital cameras are a simple to use, widely-available technology from which large datasets already exist. The detraction of using digital cameras is that the accuracy of colour estimation is limited by the agreement between the null space of the camera and cone spectral sensitivities, but increasingly the compromise of RGB cameras has been accepted for statistical studies [26,27].
Any image database that we could use is open to criticism as to its 'naturalness', since the term is vague and, to the extent that it is not, is surely variable with time and place. We acknowledge this but choose the THINGS dataset [28] which has been assembled for research in psychology, neuroscience and computer science. This dataset consists of 26,107 images organized into 1,854 common object categories according to the main object present within them. Each image is 800×800 pixels and has been manually cropped from a larger source image. The source images were collected by internet search for each category, and selected from the results according to quality, variation and absence of artefacts such as text or borders. Examples are shown in Fig 3. Although the variation across images in their RGB content is a topic of great interest, here we simply compute the overall histogram for the entire dataset (equivalently the arithmetic mean of the per-image histograms). The image data is all 8-bit depth in each channel, but to ease later computations we compute the histogram at 5-bit depth, hence with 2 5 ×2 5 ×2 5 �30K bins. Since even 2-bits per channel is sufficient roughly to segregate the 11 Basic Colour categories [29] we are confident that this is adequate resolution. This mean histogram is visualized in Fig 4, with observations: • There is a high-density core around the achromatic axis [27] containing 80% of the colours present in the images (Fig 4, top row).
• The core has roughly the shape of a thick, asymmetric, elliptic leaf with the central vein running along the achromatic axis. The plane of the leaf extends in the orange-blue direction, further into orange than blue. The thickness of the leaf is in the purple-green direction [27].
• Against the plane of the leaf, lies an additional 'worm' of higher density extending from 'leaf green' towards black, presumably due to the commonality of vegetation in the environment; and along one edge, a 'fringe' extending from 'sky blue' towards white, presumably due to sky.
• The tails of the histogram (Fig 4, bottom row) are not a simple extension of the core.

Assessing realism
Our aim is to develop a model which produces individually realistic spectral reflectance functions, with a realistic distribution of colours. We describe how we quantify realism.

Spectral functions
We assess the realism of a model of reflectance functions by how well they can be distinguished from empirical functions. We measure this using machine learning methods. Specifically, we train a classifier to distinguish between empirical and synthetic functions, and then test how well it performs on a test set. If the test classification is correct at chance (50%) levels we consider the synthetic functions to be realistic, but if performance is higher the classifier must have discovered distinguishing features so we consider them as unrealistic.
We use both random forest [30] and elastic-net regularized logistic regression [31] for classification, reporting the better performing of the two. We train a classifier using: as examples of empirical reflectances, 12 randomly selected from each of all but one of the collections (thus 96 = 12×(9-1) examples); and, as examples of synthetic reflectances, 96 functions randomly generated by the model. The size of each training set (only 12 reflectances per collection) is determined by the smallest collection, and the desire to represent the collections equally, to best capture the cross-collection variation.
We test the classifier using 12 random reflectances from the left-out collection and 12 randomly generated synthetic reflectances. We score the classifier by the rate at which it gives a higher confidence for an empirical test reflectance than for a synthetic (equivalent to the area under the ROC curve). We repeat the assessment hundreds of times, leaving out different collections, and take the overall average score as the final measure. We perform two modes of assessment. One as described, and one colour-balanced in which, in both train and test sets, each model-generated reflectance is colour-matched to an empirical reflectance. Colourmatching is achieved by creating model reflectances until one of the correct colour is generated. The colour-balanced mode prevents the classifier using the diversity of reflectance functions as a cue, forcing it to attend only to their individual form.

Colour
We assess the colour realism of a model of reflectance functions by using a formation model (section 3) to compute the image colours of the reflectances; compiling those colours into an RGB histogram; and comparing that histogram to the mean image histogram. We use three well-known measures to compare RGB histograms.
The simplest measure is the Intersection Distance (ID) which focusses on how much mass the two histograms have in common. It measures the percentage of histogram density that is specific to one of the histograms; the lower it is, the more similar the histograms. ID focusses on the size of histogram bin count mismatches, ignoring colour distance.
A measure that does considers colour distance is the Earthmover's Distance (EMD) which measures the mean distance that mass must be transported optimally to transform one histogram into the other. We express the EMD relative to the side length of the RGB cube.
The third measure is Jensen-Shannon Divergence (JSD) which tells how much information (measured in bits) is on average gained about which was the originating histogram given a single colour picked randomly from one or other of the histograms compared. Like ID, JSD ignores colour distance, but unlike ID it is particularly sensitive to bin count mismatches in the tails of the histograms.
To further introduce the measures we compute them for an example pair of histograms, one derived from the top half of the THINGS image dataset, and the other from the bottom half. These 'half histograms' are visualized in Fig 5 which also introduces swatches, an alternative to iso-density visualization. The figure shows that the top histogram (T) has extra pale blue and white colours, and the bottom (B) extra black and brown. This is due, no doubt ( Fig  3), to more frequent sky/clouds above and shadows/earth below (see also [27]). The distance measures between these half histograms are: ID = 8%, EMD = 3.5% and JSD = 0.01. We will focus on ID in the remainder because of its easy interpretation.

Colour-from-reflectance formation models
Our aim is to assess the consistency between reflectance and colour natural statistics. To do that we predict the distribution of colours that will arise from a sample or distribution of reflectances, and compare that to the distribution of colours observed in images. To predict colours from reflectances we require formation models. We consider two: a simple uniformillumination model and a more complex Ostwald shading+specularity one.
The simple formation model assumes uniform average daylight illumination [32] specified as a spectral energy function D 65 : L ! R þ ; Lambertian surfaces; the CIE 2012 2˚xyz colour matching functions m ! ðlÞ≔ðxðlÞ yðlÞ zðlÞÞ T (derived from [33]); and the standard conversion matrix M≔  The simple formation model ignores shading (non-uniform illumination intensity) and highlights (specular reflection) which are ubiquitous in natural scenes (see Fig 3). Fig 6 illustrates how these processes effect colour: shading shifts colour towards black, and specularity towards white, neither process effecting hue [35]. The Ostwald formation model incorporates shading & specularity by making use of the Ostwald triangle representation [36] which organizes the possible effective colours of a homogeneous surface (Fig 6 right). The effective colours arise as partitive mixtures of the full colour, black and white-a partitive mixture being a weighted additive mixture, where the weights are positive with unit sum [37].
The Ostwald triangle can also be considered as organizing the effective reflectances of a surface. Let ρ be a surface reflectance function; c The effective reflectance associated with that point is ρ eff ≔o c ρ +o w .
We model the effects of shading and specularity in the colour-from-reflectance process using a distribution over the Ostwald triangle to capture how often different degrees of shading and/ or specularity occur during the formation of an image. We represent that distribution using a bivariate normal distribution over a latent variable domain l ! 2 R 2 , which transforms to Ostwald coordinates according to l Fig 7 (left) shows Ostwald coordinates as a function of the latent domain, with an example bivariate normal distribution superimposed (green). This distribution over the latent domain transforms to a distribution over the Ostwald triangle (Fig 7, centre). In the example shown most of the mass is along the shadow series from full illumination to black, while the remaining mass is near the white specularity corner of the triangle . Fig 7 (right) shows a swatch for that distribution. It corresponds roughly to the distribution of colours arising from the sphere in Fig 6 (left).
Using the simple and Ostwald formation models, we have computed the RGB histograms that result from the dataset of empirical reflectances. As previously noted, we sample equally from the nine collections that form the dataset, even though they have different numbers of members. The process is (i) randomly sample a reflectance from the dataset, (ii) compute its colour according to the formation model, (iii) accumulate a count in the corresponding cell of the RGB histogram, (iv) repeat 1M times.
Step (ii) is deterministic when using the simple model, and stochastic when using the Ostwald model. In the latter case we generate a sample from the bivariate normal Ostwald distribution, and convert it to Ostwald triangle coordinates; from these coordinates and the sampled reflectance we compute an effective reflectance; then compute its colour as per the simple model.  Here, and whenever we use the Ostwald formation model, the five parameters of the bivariate normal distribution (Fig 7, left, green) have been tuned to minimize the ID between the resulting histogram and the mean histogram. Though the exact parameters vary, in all cases the optimized distribution is similar to the example shown in Fig 7.  Fig 8 shows the mean image histogram and the empirical histograms from the reflectance dataset when using the two formation models. For each histogram we use a gallery of visualizations that we will use throughout, with notes: • the hue×saturation marginal is plotted with grey-level indicating log density, greatly exaggerating the higher saturation flanks, but allowing them to be seen; • saturation marginals typically peak at low, but non-zero, saturation; even though the hue×saturation marginal density peaks at achromatic. This can happen because there are more lowsaturation histogram bins than achromatic.
• The Basic Colour Category marginal is computed using a soft assignment of histogram bins to Basic Colours computed from a large corpus of colour naming data [39]. In detail,~20K unconstrained colour naming responses to 600 RGB specified colours were collected in an online experiment from~1K subjects. A gaussian-form model of the response rate for each produced colour named was fitted as a function of RGB coordinates. Response functions for variants of individual basic colour (e.g. 'red', 'strong red', etc.) were summed. Colour terms that involved multiple basis colours were allocated to the involved colours depending on the syntax of the term (e.g. for 'orange-red' 50% of the weight went to 'orange' and 50% to 'red', whereas for 'reddish-orange 66% went to 'orange' and 33% to 'red'). The combined basic colour response functions allowed a distribution of basic colour responses to be determined for any RGB value. For many RGB triples this was a sharp correspondence (e.g. 100% 'red') whereas for others it was fuzzy (e.g. 70% 'red', 20% 'orange', 10% 'pink'). The fuzzy assignment of RGB to basic colours has been included in the downloadable material for this publication. The area of the square indicates the mass in a bin.
Comparing the histograms in Fig 8 we observe: • With the simple formation model the empirical histogram is too spikey as the number of reflectance functions is small compared to the number of histogram cells; this is improved by using the Ostwald formation model.
• With the simple formation model the empirical histogram has much fewer very dark colours (and to a lesser extent) very light colours than the mean histogram; this is improved by using the Ostwald formation model.
• The saturation marginal of the empirical histogram is a good match to the mean histogram with the simple formation model, but becomes too low in saturation when using Ostwald shading.
• The hue marginals are all in fairly good agreement, except purples are rarer and blues more common in the mean histogram.
• The Basic Colour marginals are in fairly good agreement, except for achromatic colours with the simple formation model.
In Table 1 we give distances between the mean histogram and the empirical histograms computed using the two formation models. As context, recall that the distances between the top-and bottom-half histograms were JSD = 0.01, EMD = 3.5% and ID = 8%. The figures in the table show that distances are much smaller with the Ostwald than the simple formation model. These reduced distances, together with the fact that the tuned Ostwald distribution (Fig 7, middle and right) agrees with expectation, validate the use of the Ostwald formation model as a necessary improvement over the uniform-illumination model.

Previous reflectance models
Instead of using a dataset of reflectance functions directly as a model of the distribution of natural reflectances, one can generalize from it in the form of a generative model. Two such models have previously been described. The best-known of these (PCA) uses Principal Components Analysis [40] to construct a high-dimensional normal approximation [19, 20,  • The mean reflectance has a positive slope.
• There is a power-law like decay in the magnitude of components.
• Components are roughly sinusoidal with frequency increasing with rank.
• A fraction of generated components have values outside the legitimate [0,1] range for reflectance, but this can be remedied by clipping.
The second previous model (fourier) [21] builds on the regularities observed in PCA: specifically that components are sinusoidal of increasing frequency, and their amplitudes follow a simple pattern. Like PCA it models the distribution of reflectance functions as a high-dimensional normal distribution but, rather than data-determined components, it uses sinusoids of increasing frequency (integer number of cycles over Λ), amplitude which decays in a power law fashion with frequency, and random phase. The fourier model is very compact-all that is needed to specify it is (i) the value of the DC component, (ii) the overall amplitude of the fourier components, (iii) the exponent of the power law decay.
In the original description of the fourier model [21] it is argued that it is advantageous to represent reflectance r2[0,1] using an unbounded latent variable t 2 R related by t!r≔(1+tanht)/2 and r!t = tanh −1 (2r−1) (see Fig 10). This removes the need to clip out-of-range functions that occurs otherwise (Fig 9, bottom), and more significantly allows better representation of functions with extreme reflectance values. We have implemented the fourier model from [21] using the parameter values there specified: a DC value of -1.0, a stochastic standard deviation of 1.2 per wavelength (both values in latent domain units) and a power law exponent of 1.8.  distribution (top), or just in their detailed form (bottom). The figure shows that, to informal inspection, the PCA reflectances are a reasonable match in both aspects while the fourier ones are not.
In Fig 12 we compare the amplitude spectra of synthetic and empirical reflectance functions. In the low-to mid-frequency range the amplitude fall-off is slower for synthetic functions than empirical, corresponding to their rougher appearance in Fig 11. In the highfrequency range the empirical spectra have higher amplitude than the synthetic-this could easily be improved by adding synthetic measurement noise to the model reflectance functions but we consider this not worth modelling.
In Fig 13 we compare the RGB histograms that result from the PCA and fourier models (when using Ostwald formation) to the mean histogram. The PCA histogram is quite similar to the mean histogram but its saturations are too low; the fourier histogram is very different from the mean histogram, especially in its equal extension in the blue and orange hue directions.
In Table 2 we summarize the realism scores for the PCA and fourier models. We consider the most important scores to be reflectance realism as measured by the colour-balanced score, and colour realism as measured by the intersection distance (ID) score when using the Ostwald formation model. Based on these we consider PCA to be a modest improvement over using the reflectance dataset directly as a model; while fourier is slightly worse.

Improved and new reflectance models
We have developed PCA+ which aims to be an improved version of PCA. The differences are (i) it operates in the latent (t) rather than the reflectance domain (r), as innovated by [21] (ii) after transformation to the latent domain, empirical reflectances are normalized to zero mean, before computation of the PCA components, (iii) a tunable normally-distributed DC component is used, (iv) a tunable gain factor on the stochastic components is applied. Thus PCA+ has three tunable parameters which are optimized to minimize the intersection distance between the resulting colour histogram and the mean histogram.
We have developed fourier+ which aims to be an improved version of fourier. The model has a normally-distributed DC term (two parameters), a normally distributed linear slope  Table 2. Realism scores comparing the empirical dataset approach with the two existing models of spectral reflectance. For all metrics lower numbers indicate greater realism. The shaded columns contain the most important realism metrics. The left concerns individual reflectance functions, and the right the distribution of colours arising. The '+5' note for Ostwald-shading reminds of the number of parameters tuned to minimize the ID scores.  Reconciling the statistics of spectral reflectance and colour term (two parameters), and tunable stochastic magnitude. As described in section 2.1, we have independently estimated the power-law coefficient from the reflectance dataset as 2.3, different from the original published estimate of 1.8 [21]. The five tunable parameters are optimized as for PCA+.
We have developed a new sigmoid-sum model of reflectance functions, inspired by the qualitative appearance of our empirical dataset. The model is explained in Fig 14. It operates in the latent domain. Multiple sigmoid components are summed; each of the form a×erf((λ−c)/σ), where the error function erf() is the version with the range [−1,1], and the amplitude (a), centre (c) and width (σ) are chosen randomly. Each sigmoid is constructed by first choosing its width using a log-normal distribution, then choosing its the centre uniformly in the interval [380−2σ, 720+2σ] nm; the amplitude is chosen as a normally-distributed random variable independently of the width and centre, and can be positive (increasing sigmoid) or negative (decreasing). The number of sigmoid components in the sum is a random poisson-distributed variable. The sum of sigmoids is normalized to zero-mean before a normally-distributed DC component is added. Finally, the sum is transformed from the latent to the reflectance domain.
The sigmoid-sum model has six tunable parameters: mean number of components; mean log-width; mean and sd of the amplitudes; mean and sd of the DC component. These were tuned as for PCA+ and fourier+. The optimal values were: DC component N[−0.51, 1.03], μ = 10.5 for number of components, widths distributed as e N [5.1,1.4] , and amplitudes distributed as N(0.11, 0.36). The distribution of amplitudes is such that 62% of sigmoid components are increasing. The sd of the width distribution was chosen (given the tuned mean) so that the fourier spectra of the resulting reflectance functions had a power-law law fall off with exponent 2.3, as per the empirical data In Fig 15  Reconciling the statistics of spectral reflectance and colour functions that are more realistic than the previous models. In Fig 17 we show the colour histograms that result from the new models, and all can be seen to be roughly equally similar to the empirical colour histogram.  show that the new models are difficult to rank by eye according to their spectral and colour realism. The quantitative assessments of realism presented in Table 3 do allow us to objectively compare them. Observations are: • PCA+ produces less realistic reflectances than PCA, but slightly more realistic colours.
• fourier+ is a considerable improvement over fourier, and has more realistic reflectances and colours than PCA+.
• sigmoid-sum is the most realistic model in both key metrics, and in fact has the best scores in all metrics.

Key features for realism
Looking across the results we draw conclusions about the key model features needed for realism. Spectral Reconciling the statistics of spectral reflectance and colour • Variation in the mean reflectance as well as variability about that mean is important. fourier fails to very realistic (Fig 11) because its functions have the form of excursions from a fixed low reflectance, rather than variations about an intermediate value that itself varies from function to function.
• Reflectance functions need to have the correct frequency content. PCA and fourier have an excess of mid-frequency power which causes them to be obviously too rough. PCA+, fourier + and sigmoid-sum have correct frequency content and look more realistic as a consequence.
• Correct frequency content alone does not lead to the localized transitions that are visible in some empirical reflectance functions. That requires coordination of phases to align components. Only the sigmoid-sum model is able to reproduce these transitions, though possibly to slight excess.
• The explicit modelling of localized step transitions in sigmoid-sum gives rise to occasional localized peaks and troughs, in sufficient number to model their rare appearance in empirical functions.

Colour
• As previously observed in [21], the smoothness of empirical reflectance functions leads to orange and blue peaks in the hue marginal. We further note that the orange peak is much higher than the blue (Fig 12), and explain it as due to the greater frequency of positive reflectance slopes than negative. The PCA and PCA+ models capture this tendency in their mean reflectance functions. fourier fails to capture the tendency (observe the equal peaks in the hue marginal in Fig 13). This is fixed in fourier+, with its explicit modelling of a stochastic linear term which gives a positive slope for the majority of reflectances; and in sigmoid-sum, with its preference for individual sigmoid components to be increasing rather than decreasing.
• Purple is notably rare in natural images (Fig 8,

Realism assessed
The new sigmoid-sum model scored most highly for both aspect of realism, with the fourier+ model a close second. The sigmoid-sum realism scores are not perfect though, so it needs to be considered how realistic the model is in absolute terms. First consider the realism of the individual reflectance functions. This was assessed by the performance of a machine learning classifier trained to discriminate between empirical and synthetic functions. The best classifier found was able to do this correctly 53% of the time. This is certainly a low success rate-one would need to observe~500 trials to become fairly certain that the rate was above chance. However the performance is limited by the ability of the classifier to identify distinguishing features, and while machine learning classifiers are capable of identifying distinguishing features that humans might miss, the same is true vice versa. Ultimately this quantitative assessment of realism should only be considered a supplement to eye measure. In Fig 18 we compare empirical and synthetic functions to allow the reader to make this assessment. The task is certainly possible but not easy, and the most obvious cues are measurement artefacts in the empirical data such as plateaus at short wavelengths or high frequency noise-we could easily simulate these and make the task very difficult. (Empirical functions are in the top row in the outer columns).
Next we consider the realism of the colour histogram for sigmoid-sum (Ostwald).  Table 3 quantified it as an intersection distance of 20%, a Jensen-Shannon Divergence of 0.06 bits, and an Earth Mover's Distance of 1.0%. In Fig 19 we give a side by side comparison of swatches for the sigmoid-sum and mean image histogram. They are visually very similar but some differences can be seen. The difference becomes more apparent in the intersection and excess histograms in the same figure.
The similarities and differences between the model and empirical histograms can be understood from Fig 17 (2 nd column). The figure shows that the sigmoid-sum model does a good job of accounting for the main leaf-shaped region of high density in the natural image histogram; thus colours from this region can thus be understood as arising from a simple statistical model of reflectance. The sigmoid-sum model however fails to model the additional 'worm' of green (hypothesized to arise from vegetation) and a 'fringe' of light blue (hypothesized to arise from sky) that the mean histogram exhibits. As shown by the excess histograms in Fig 19, it undershoots for the worm and overshoots for the fringe, capturing other blues along with sky blue. No doubt the sigmoid-sum model could be made more complex to capture both elements correctly, but we do attempt that here. The model already has five tuned parameters for shading and specularity, and six for the reflectance model. The total count of eleven tuned parameters is not excessive given that a tri-variate normal distribution has nine degrees of freedom, and the mean histogram has a more complicated form than that.

Summary & conclusions
We have considered models of the distribution of spectral reflectance functions in the natural environment. Assessing them by whether (i) they generate individually realistic functions, and (ii) the distribution of their colours is realistic. For both tests we assess realism by comparison with datasets that we take as representative of the natural environment. We found that a dataset of actual empirical reflectance functions while necessarily individually realistic did not have a distribution of colours that was an excellent match to natural images. A common existing model of reflectance (PCA) produced fairly realistic functions and a better distribution of  Reconciling the statistics of spectral reflectance and colour colours than the empirical reflectance dataset, while another existing model (fourier) was worse. We were able to slightly improve the PCA model (PCA+), and significantly improve the fourier model (fourier+) by incorporating stochastic constant and linear components. While the fourier+ model can be considered to produce realistic reflectances and colours, we were able to slightly improve on it with a novel sigmoid-sum model. We make available for download a large dataset of pre-computed model reflectance functions at doi.org/10.5281/zenodo. 3385047 as well as the fuzzy membership functions needed to compute basic colour marginals. Our model accounts simultaneously for the statistics in two realms of the colour phenomenon-Popper's World I of physics, where spectral reflectance functions belong-and Popper's World II of perception, where colours do. Our model raises questions whose answers could extend the theory further down and up. Down into chemistry-why is there a tendency for natural reflectance to increase with wavelength; and up into cognition-are some colours more stable than others under illumination change, and does this correlate with colour categorization as has been suggested [42]. Compares the empirical and model image histograms (top) and shows the breakdown into intersection and excess histograms that give rise to the intersection distance of 20% (bottom). The failure of the model to capture the abundance of vegetation green is apparent in the empirical excess histogram (bottom-left). The models broader range of blues than required to capture sky-blue is apparent in its excess histogram (bottom-right).