Beyond ℓ1 sparse coding in V1

Growing evidence indicates that only a sparse subset from a pool of sensory neurons is active for the encoding of visual stimuli at any instant in time. Traditionally, to replicate such biological sparsity, generative models have been using the ℓ1 norm as a penalty due to its convexity, which makes it amenable to fast and simple algorithmic solvers. In this work, we use biological vision as a test-bed and show that the soft thresholding operation associated to the use of the ℓ1 norm is highly suboptimal compared to other functions suited to approximating ℓp with 0 ≤ p < 1 (including recently proposed continuous exact relaxations), in terms of performance. We show that ℓ1 sparsity employs a pool with more neurons, i.e. has a higher degree of overcompleteness, in order to maintain the same reconstruction error as the other methods considered. More specifically, at the same sparsity level, the thresholding algorithm using the ℓ1 norm as a penalty requires a dictionary of ten times more units compared to the proposed approach, where a non-convex continuous relaxation of the ℓ0 pseudo-norm is used, to reconstruct the external stimulus equally well. At a fixed sparsity level, both ℓ0- and ℓ1-based regularization develop units with receptive field (RF) shapes similar to biological neurons in V1 (and a subset of neurons in V2), but ℓ0-based regularization shows approximately five times better reconstruction of the stimulus. Our results in conjunction with recent metabolic findings indicate that for V1 to operate efficiently it should follow a coding regime which uses a regularization that is closer to the ℓ0 pseudo-norm rather than the ℓ1 one, and suggests a similar mode of operation for the sensory cortex in general.


Introduction
Sensory neurons produce a variable range of responses to stimuli, the most frequent one being inactivity [1,2]. To explain this, Horace B. Barlow hypothesized that the task of sensory neurons is not only to encode in their activity an accurate representation of the outside world, but to do so with the least possible number of active neurons at any time [3]. Since then, growing experimental evidence across species and sensory areas has confirmed these claims of sparse activity [4][5][6][7][8].
Using Barlow's hypothesis as an optimization principle, Olshausen and Field showed that a neural network equipped with a learning algorithm that is set to reconstruct natural images with sparse activity constraints develops units (atoms) with properties similar to the ones of receptive fields (RFs) of simple cells in the primary visual cortex (V1), i.e. they are bandpass, oriented and spatially localized [9]. The model proposed by the authors belongs to the family of generative algorithms which represent a stimulus as a linear combination of atoms taken from an overcomplete dictionary, i.e. a set of vectors with more basis vectors than the dimension of the stimuli. In the context of V1, the vectors from the dictionary and their accompanying coefficients correspond to the neurons' RFs and activities, respectively.
Computationally, overcompleteness comes with a number of advantages: the input can be in a compressed form [10], the emerging vectors in the overcomplete dictionary are shiftable, and transformations of the input image such as rotations or translations can be represented by smooth changes in the coefficients [11]. Experimental findings in the macaque show that overcomplete dictionaries reflect the expansion of inputs in layer 4Cα of V1 compared to the lateral geniculate nucleus (LGN) input to it; approximately, 30 lateral LGN neurons send their axons to a V1 hypercolumn containing about 3000 excitatory and 1000 inhibitory neurons [12][13][14].
Sparse approximation aims to find a linear combination of the dictionary vectors that has few nonzero coefficients but also adequately represents the input signal. Ideal sparse approximation requires the minimization of the noncontinuous and nonconvex ℓ 0 pseudo-norm, which counts the number of nonzero coefficients, combined with some data fitting term. However, this problem is NP-hard, as its solution requires an intractable combinatorial search [15, p.418]. Greedy pursuit methods are practical solutions which bear ressemblance to neural spiking processes [16], yet their efficiency can be improved. In many applications, the ℓ 0 pseudo-norm has been replaced with its convex relaxation, the ℓ 1 norm, defined as the sum of the absolute values of the coefficients. The use of the ℓ 1 relaxation has become widespread in sparse coding, due to its convexity, and since under certain conditions [17,18] solutions of the ℓ 1 -penalized sparse coding problem coincide with the ones making use of ℓ 0 regularization. In general, however, ℓ 1 -based models show inferior results in terms of sparsity [19,20].
Over the last decade, advances in optimization theory and in the field of compressed sensing [21] have provided several tools allowing for the replacement of the ℓ 0 pseudo-norm with tractable functions approximating it. The use of such approaches provides solutions for many perceptual and behavioral tasks that are in line with the energy constraints in the brain, unlike exact solutions that need perfect prior knowledge and costly computations [22].
In this work, we examined different sparse coding algorithms relying on the use of tighter thresholding functions associated to the use of ℓ q penalties, with 0 ≤ q < 1. We found that their solutions induce sparsity to a greater extent compared to the ℓ 1 method (soft thresholding) while they maintain the same reconstruction of the signal. As a further penalty we used the Continuous Exact ℓ 0 relaxation (CEL0) [23] which produced the sparsest codes for small degrees of overcompleteness.
We then analyzed the RFs learned by the resulting sparse coding algorithms and compared them with each other and with the RFs found in the visual cortex of non-human primates. We found that all algorithms yield localized oriented RFs. The tuning width of the RFs showed variability, a characteristic that is also present in V1 cells [24,25]. In accordance with the 2/26 oblique effect and its representation in the visual cortex [26,27], we found a preference towards the vertical orientation both in terms of overrepresentation and increased sensitivity of RFs tuned to it.
We show that for a relatively small degree of overcompleteness (approximately 2 times the size of the input stimulus), the sparse approaches considered produced both sharply and broadly tuned orientation selective RFs, similarly to V1 neurons in the cat and the macaque [25,28]. In terms of performance, when keeping sparsity constant for all methods, we found that soft thresholding requires 10 times more units (a 20× degree of overcompleteness) to reconstruct the input image patch as well as CEL0. The other methods considered, relying, e.g., on ℓ 1/2 minimization [29] and on the hard thresholding algorithm [30] are inferior to CEL0 in terms of reconstruction performance but still superior to soft thresholding, and show more robust convergence to a solution for a dictionary with larger degrees of overcompleteness compared to CEL0.
By definition, ℓ 1 regularization employed by soft thresholding limits average neural activity (and subsequently metabolic energy consumption) rather than the number of neurons [9,31]. Here, we show that the regularizers pushing for less units (not smaller activities) develop RFs with variability in their response patterns, a feature which is present in V1 [25,28,32]. In line with our results, a recent study on mice showed indeed that natural images could be decoded from a very small number of highly active V1 neurons, and that diverse RFs ensure both reliable and sparse representations [33].
Neural activation and learning based on ℓ 1 minimization could in theory accommodate the variability in V1 since there is a 100 fold size expansion from LGN to V1 [12][13][14]. But this 100× degree of overcompleteness is misleading since V1 neurons are divided into many different pools with different objectives and follow different visual streams [34][35][36], even though there may be some overlap in their functionalities [37]. So with the constraint that the expansion of V1 neurons compared to LGN is divided for different image reconstruction tasks, the observed optimality of brain processes suggests that neural activation is regularized in such a way that it produces the variability of RFs found in the visual cortex with the smallest degree of overcompleteness. This in turn indicates that V1 uses heuristics that are closer in approximating ℓ 0 penalties rather than ℓ 1 .

Image dataset and preprocessing
For our tests, we used a selection of 137 natural images from the van Hateren's database [38]. These images did not contain artificially created structures neither significant blur, similarly as in [39]. We performed the same preprocessing stage described in [40]: first, we rescaled all images separately between zero and one, then we normalized them by subtracting and dividing each pixel value by the image mean and standard deviation respectively. The resulting zero mean, unit variance images were then passed through a whitening filter in order to emulate the response of retinal ganglion cells. The images were finally rescaled so that they have a variance of 0.1. This value serves as a baseline error, i.e. the mean square error (MSE) of a preprocessed image with an image with only zero pixel values (produced when all the coefficients of a neural code are zero). S1

Model setup and cost function
According to the linear generative model of Olshausen and Field [9], an image I ∈ R M is described as a linear combination of vectors are stored column-wise in a matrix Φ ∈ R M×N . The scalar coefficients of such linear combination are collected in a vector r ∈ R N and an additive white Gaussian noise component ν ∈ R M with ν ∼ N (0, σ 2 Id) is added to model perturbations and uncertainty: We consider features (columns) Φ to form an overcomplete dictionary, i.e. N ≫ M . Consequently, the inverse problem of finding r given I in (1) becomes ill-posed since r may have an infinite number of possible solutions. To impose well-posedness, Olshausen and Field [9] considered a sparse regularisation approach, defined in terms of the energy function: While the first term in (2) pushes towards the preservation of stimulus information, the second term acts as regularization imposing a penalty on activity with the relative weight of the two competing tasks being controlled by a parameter λ > 0. The regularization term is a sparsity-promoting penalty that ideally encourages the number of active units to be as few as possible. For that, one would like to choose as c(·) the so-called ℓ 0 pseudo-norm of z which costs 1 whenever z = 0 and 0 otherwise: c(z) = ||z|| 0 , with z ∈ R. However, as shown rigorously in several mathematical works (e.g., [41]) such choice makes the problem of minimising E in (2) NP-hard. A standard strategy, used in several sparse coding approaches, relies on the use of the convex and continuous ℓ 1 norm as a relaxation, i.e., c(z) = |z| for z ∈ R. Under suitable conditions on the matrix Φ, such choice guarantees indeed that the solution computed is equivalent to the one corresponding to the ℓ 0 pseudo-norm. The use of the ℓ 1 norm is in fact established in the field of compressed sensing and sparse signal/image processing [42].
For general choices of c(·), the problem of finding both optimal sparse codes r * (coding step) and feature vectors Φ * for the given input stimulus I (learning step) can be formulated as the problem of minimizing the energy function E with respect to both r and Φ, i.e: In the following, we use alternating minimization (see, e.g., [43]) to solve the problem above. Below, we thus make precise the general approach for solving the coding and learning steps. Subsequently, we consider few cost functions promoting sparsity in different ways.

Coding step
Our objective is to minimize the composite function E(r, Φ) in (2) which is defined as the sum with f : R N × R M×N → R + being convex and differentiable with L-Lipschitz gradient w.r.t. both variables and C : R N → R + being convex, proper and lower semi-continuous, but, generally, non-smooth. As far as the coding step is concerned, we then need an algorithm solving the structured nonsmooth optimisation problem (3) w.r.t. r. A standard strategy for this is to use the proximal gradient algorithm (see [44] for a review). For a given step-size µ ∈ (0, 1/L], x 0 ∈ R N ,Φ ∈ R M×N and t ≥ 0, such algorithm consists in the alternative application of the two steps: where the proximal operator associated to the function λC(·) and depending on the step-size parameter µ is defined by: It is common to denote by T µλ (z) the thresholding operation corresponding to prox µ,λC (z), which sets to zero components of z which are too large depending on a certain thresholding rule defined in terms of the choice of C and the thresholding parameter µ, see Section for some examples. Note that during coding, the vectors inΦ are kept fixed, so that the algorithm seeks the optimal activations for the given input image patches.

Learning step
During learning, the matrix Φ ∈ R M×N is updated so that it is optimal in reconstructing the input I as accurately as possible. The learning step is thus obtained by minimizing (2) w.r.t. Φ, by considering gradient-type iterations. This step is easier since Φ appears only in the smooth data fit term and not in the cost term. Learning is then obtained for all t ≥ 0 via the iterative procedure: where η > 0 is the learning rate whose size has to be small enough to guarantee convergence. Note that, although such update of Φ does not depend explicitly on the particular choice of the cost function considered, it depends nevertheless on the current estimate of the coefficients r which, in turn, depend on the particular choice of C and, consequently, T µλ . During each learning step, we impose the norms of the current iterate Φ k to be equal to 1, though other normalization mechanisms could be explored [45].

Thresholding operators
For different choices of the component-wise cost functions c : R → R + , different thresholding rules are derived. We consider below some particular choices of c, plotting them in Fig 1A for comparison. For each choice, we then report the corresponding explicit thresholding operator which sets to zero coefficients with small magnitudes, see Fig 1B for an illustration. As a technical note, we remark that in definition (5) we assumed the function C is convex, so that the minimiser of the functional is uniquely defined due to the strong convexity of the composite function. A large class of the cost functions considered below, however, are not convex, hence definition (5) still holds, but with a ∈ sign in place of the equality one, since the set of minimizers may not be a singleton.
The iterative soft-thresholding algorithm (ISTA) For c(r i ) = |r i | for all i = 1 . . . , N sparsity in (2) is obtained by considering as regulariser the function C(r) = λ r 1 which, from an algorithmic viewpoint, corresponds to the Iterative Soft-Thresholding Algorithm (ISTA) as an algorithmic solver [30,46]. Thanks to the separability of the ℓ 1 norm, the proximal operator can be computed component-wise [47]. Corresponding thresholding operators. We denoted by ISTA the operator S θ (z) being the thresholding operator associated to the use of ℓ 1 and by Hard the operator H θ associated to the iterative hard thresholding algorithm.
Setting θ := µλ > 0 and S θ (·) := T θ (·), it holds that for all i = 1, . . . , N : Such operation is typically known in literature under the name of soft-thresholding operator due to its continuity outside the vanishing thresholding region.

The iterative half thresholding algorithm
To favour more sparsity than the ℓ 1 norm, a natural improvement consists in using the ℓ q (0 < q < 1) pseudo-norm, i.e. setting c(r i ) = |r i | q . However, such choice makes the optimization problem (2) nonsmooth and nonconvex and, in general, prevents from using fast optimisation solvers. One exception to this is the case when ℓ 1/2 regularization is used. Xu and colleagues showed in [29] that an iterative half-thresholding algorithm can solve the problem of minimising the ℓ 1/2 pseudo-norm with an ℓ 2 data fit term with the algorithm converging to a local minimizer in linear time [48]. Using analogous notation θ = λµ as before and denoting by Ξ θ,1/2 (·) the thresholding operator T θ , thresholding can still be performed component-wise as follows: where: and ψ θ (r * i )) = arccos 6/26 Despite its complex form, the thresholding function (8) is still explicit, hence its computation is very efficient.

Iterative hard thresholding
The iterative hard thresholding algorithm has been introduced firstly in [30] to overcome the NP-hardness associated to the minimization of the ideal problem: forΦ ∈ R M×N . The idea consists in considering the following surrogate function defined for r, z ∈ R N as: for which there trivially holds E ℓ0 (r) = E S ℓ0 (r, r) for all r ∈ R N . Minimizing (12) with respect to r can thus be seen as a strategy to minimize (11), as desired. The derived thresholding operator is here denoted by H θ (·) and performs element-wise hard thresholding following the rule: A continuous exact ℓ 0 penalty (CEL0) As a further sparsity-promoting regularization, we consider the non-convex Continuous Exact relaxation of the ℓ 0 norm (CEL0), thoroughly studied, e.g., in [23]. Such choice can be thought of (as it is rigorously proved in [49]) as the inferior limit of the class of all continuous and non-convex regularizations of the ℓ 0 pseudo-norm with the interesting additional properties of preserving the global minimizers of the ideal ℓ 2 -ℓ 0 minimization problem one would need to solve, while reducing the number of the local ones. For all i = 1, . . . , N and parameter λ > 0 such choice corresponds to considering as cost functional: where φ i is the i-th column extracted from the matrix Φ, φ i denotes its ℓ 2 norm and, for all z ∈ R, the notation (z) + denotes the positive part of z, i.e. (z) + = max(0, z). The corresponding CEL0 thesholding operator is defined by: where, note, µ and λ are here decoupled as the thresholding parameter is not their parameter anymore but depends on µ only and, component-wise, by the norm of the column φ i , i = 1, . . . , N of Φ. While the operation of computing the quantities φ i can be in principle costly from a computational viewpoint, we remark that by construction, in our application Φ has unit-norm columns, hence such computation is in fact not required. Our objective for the following sections consists in comparing the sparse coding obtained by the different choices of the cost functions and thresholding functions above. In particular, as the coding step affects the learning step, our study focuses on how the choice of the sparsity-promoting penalty (related to the capability of approximating the ℓ 0 pseudo-norm) affects the estimation of the receptive fields φ i at convergence. In most of our simulations, the degree of overcompleteness of the dictionary is 2×, i.e. the number of atoms (N ) is approximately twice as much as the size (M ) of the image patch the algorithm aims to reconstruct (500 atoms for a 16 × 16 patch vectorized into a 256 element vector). Previously, it has been shown for ISTA that a highly overcomplete dictionary, or a very sparse code -for a dictionary with fixed atoms-produces RFs with different functionalities [50,51]. In the following tests, we constrain the number of atoms in the dictionary (2× degree of overcompleteness), and test whether the different algorithms considered produce adequate sparse codes generating receptive fields with different functionalities while at the same time having an acceptable reconstruction performance.

A measure for orientation selectivity: circular variance
To probe the orientation selectivity of each (φ i ) N i=1 vector (unit) estimated by the different models above as well as to compare them with each other and with experimental data in V1, we used the circular variance measure (V ∈ [0, 1]) [24,52,53]. A unit with a zero circular variance responds only to a particular orientation; a unit with a circular variance of one responds to all orientations equally. Values in between show some selectivity, with the ones closer to one showing a broader orientation selectivity compared to the ones closer to zero.
The circular variance of a vector φ is defined as V := 1 − |R| where R is: with α k being the response of the unit at the orientation θ k (θ k goes from 0 to π in k = 36 equidistant steps). A plot of α k as a function of θ k for a unit corresponds to its orientation tuning curve. Below we explain how we get the α k values for all θ k orientations for a unit φ i . We first find the spatial frequency for which the unit responds the most by taking the inner product of φ i with a bank of sinusoidal gratings of different spatial frequencies, orientations, and phases. The optimal spatial frequency is the spatial frequency of the grating giving the highest value. We then proceed by testing gratings with this spatial frequency. We take the inner product of φ i with the set of gratings with this spatial frequency, orientation θ j and all the different phases. We gather from that the highest value, yielding α j . We repeat this process for all the orientations θ k . We then proceed in estimating the circular variance for unit φ i from equation (16).

Sparsity of codes
We first compared the sparsity of the codes produced by the different penalty functions above, each containing 500 units (∼ 2× degrees of overcompleteness). To make a fair comparison, we adjusted the methods' parameters λ and µ so that they all produce reconstructed images with the same MSE (the values of the parameters are shown in S1 Table). We run the algorithms for 4000 batches, with each batch containing 250 image patches. In all cases, the MSE for the last 500 batches is about 0.021 (see Fig 2A). As expected from preprocessing, the baseline error, i.e. the MSE corresponding to the image where all activations are zero is 0.1. We found that for the same MSE, CEL0, ℓ 1/2 , and hard thresholding algorithms produce sparser codes compared to ISTA, with ℓ 1/2 , and hard thresholding having very similar activity distributions and CEL0 being the approach corresponding to the sparsest solutions (Fig 2B). Unless otherwise stated, the same λ and µ parameters were employed in subsequent experiments. The vectors in (φ i ) N i=1 are updated (learning step) at each iteration on a batch of image patches. As discussed in Section , the different thresholding algorithms produce vectors φ i showing several differences. We thus asked whether the different penalties considered have similar coding performances -in terms of sparsity and reconstruction error -when the vectors φ i are not learned during coding. We probed this question by using throughout the coding steps a fixed dictionary Φ (we considered, in particular all dictionaries output at convergence of all four algorithms). We observed a sort of invariance property with respect to the dictionary used: all thresholding algorithms show similar reconstruction error and distributions of activity independently of the dictionary used in coding (S2 Fig). In general, the sparse codes from each method provide an estimation of the underlying sources of the original images. To compare the sources generated by the methods (estimated r * sequences) with the actual ones, we created images from known sources (ground truthr sequences). More specifically, we created ground truth images by multiplying each of the ground truth vectorsr with a fixed dictionary Φ IST A and adding different noise levels. The vectorsr were obtained by permuting a subset of vectors r, pre-computed by running ISTA and all producing input images with a mean square pixel value over 0.15, so that the zero solution is less probable to be optimal for r * . For each method the λ parameters were the same as in the previous experiments. We found that in the absence of noise, ISTA found codes that are sparser than the ground truth ones for most of the input image patches, with ℓ 1/2 and hard thresholding producing even sparser solutions, and CEL0 producing the sparsest codes ( Fig  3A). As we added noise to the images (while keeping the λ parameter fixed), such ranking between algorithms in terms of sparsity did not change, but the number of nonzero activations they produced increased, with CEL0, ℓ 1/2 and hard thresholding producing on average sparser codes compared to the ground truth for moderate levels of noise (50% noise), and with only CEL0 producing sparser codes for very noisy inputs (100% noise) (Fig 3B and C).

Learned dictionaries and their connections to V1
Orientation selectivity of RFs Experimental evidence indicates that neurons in V1 show great variability in their orientation selectivity: some neurons respond to a narrow band around a particular orientation, but most of them are responsive to a broader spectrum of orientations [24,25]. To examine the orientation selectivity of the vectors φ * i produced by each thresholding algorithm and draw comparisons with experimental data, we used the circular variance measure ( [24,52,53] defined in Section ). We found that the use of the ℓ 1 as well as the ℓ 1/2 penalty, and hard thresholding algorithms, have all a similar distribution of circular variances, while CEL0 has a smaller peak at low circular variance values and a more gradual decrease for higher values (Fig 4) resulting overall in a more 'spread' distribution. This is in agreement with experimental data: populations of V1 neurons in the macaques do not show a sharp peak for low circular variance values but rather a more uniform distribution across small and large values (Fig 4; data taken from [24]).
Perceptually, visual stimuli are better resolved when they are presented in the cardinal orientations-either horizontal or vertical-as opposed to oblique ones [54]. This behavioural  for the φ i learned by the different thresholding algorithms (the area in all cases was normalized to sum to 1). We also include V1 experimental data from [24] oblique effect has been suggested to emerge in part due to the over-representation of simple cells in V1 that respond to cardinal orientations as shown by single unit recordings [26,27], optical imaging [55] and fMRI [56]. Moreover, single unit recordings indicate that cardinal orientations have narrower orientation tuning curves [27]. We found that for all algorithms, the vectors φ i responding to the vertical orientation (π/2) had the narrowest orientation tuning curves as indicated by their low circular variance value (Fig 5). The proportion of vectors φ i responding maximally to the vertical orientation is, moreover, the highest compared to all the other orientations (Fig 6). Both results agree with the experimental results.

Sparsity-induced variability of RFs
More striking differences can be observed after a visual inspection of the corresponding RFs (S3 Fig, S4 Fig, S5 Fig, and S6 Fig). In particular, we see that CEL0 not only produces oriented RFs like the other algorithms but also circular RFs not generally found when using the other algorithms (Fig 7A). For ease of presentation, we focus here on a comparison The degree of sparsity (and overcompleteness) is one of the determining factors in the differentiation of a homogeneous set of RFs into a more variable one with RFs with different functions [50,51]. Taking the two resulting classes obtained by CEL0 as a reference, we set the λ parameter of ISTA to reproduce (approximately) the same distribution of activity as CEL0 (Fig 8B). We refer to this choice as sparse ISTA. We found that, unlike its original instantiation (Fig 7C), sparse ISTA produced both circular and oriented RFs (Fig 7B and  S7 Fig for all the RFs), with a MSE approximately 3 times worse than the original instance of ISTA and of CEL0.

Impact of overcompleteness on reconstruction performance
As a final test, we asked how the reconstruction performance of the thresholding algorithms improved as we increased their degree of overcompleteness, i.e. the number of atoms while keeping the level of sparsity constant (see S8 Fig for a more detailed explanation, and S2 Table  for the λ values for the different methods and number of units). We found that the ℓ 1 model needs approximately 5000 units to reach the reconstruction performance of CEL0 for 500 units (Fig 9). We also observed that for dictionary sizes greater than 2000 units (i.e., greater than 4× degrees of overcompleteness) ℓ 1/2 thresholding has a smaller reconstruction error than hard thresholding, with both performing better than ISTA for any dictionary size. CEL0 provides the best reconstruction up until 1000 units. For more units at this level of sparsity, the algorithm did not converge for certain patches, and the results were not reported.

12/26
Mean square error (MSE) All methods have better reconstruction error than ℓ 1 -type reconstruction for all dictionary sizes tested. The plot shows the MSE of the different thresholding methods for several dictionary sizes. The parameter λ has been adjusted for each dictionary size and algorithm so that the sparsity level is approximately stable (see S8 Fig). Each time we run 1600 batches, with 250 patches in each patch (in total 400000 patches), and took the mean and standard deviation of the reconstruction error of the last 100 batches.

Discussion
We show that continuous, non-convex thresholding-based algorithms produce much sparser activations for the same reconstruction error compared to the classically used soft thresholding 13/26 algorithm ISTA making use of the convex ℓ 1 regularizer. Among the penalties tested, CEL0 shows robustness to various levels of noise, producing sparse codes even for large levels of noise, with ℓ 1/2 and hard thresholding following in terms of performance. All iterative algorithms maintain their sparse coding level independently on the dictionary used. Sparsity is an optimization principle that pushes the RFs to diverge from a monolithic functionality into a richer variety [50], a characteristic that is present in V1 [28]. CEL0 produced this diversity for the smallest degree of overcompleteness (∼ 2×) for a set reconstruction error. When the same sparsity level is enforced for ISTA, its reconstruction error is about 3 times worse than CEL0. We then maintained the same sparsity for different dictionary sizes and found that soft thresholding performs worse in terms of reconstruction performance against all three competing methods. While CEL0 provides the best reconstruction performance, ℓ 1/2 and hard thresholding converge to a sparse solution for a wider range of dictionary sizes.
Natural image statistics show complex redundancy, and an underlying purpose of sparse neural codes is to get rid of in their representations [57][58][59], by replacing it with a simpler form of redundancy, a prior on the individual neuron activity that is highly peaked at zero [40]. In non-convex regimes, the distributions of neuronal activities move from Laplace-type to Dirac-like [31]. Our results suggest that the closer the units reach this distribution of activity, the better the generative model will match the diversity of RF shapes found in V1. It will be interesting to examine how this level of sparsity fits with neuron models that capture the highly nonlinear operations of dendrites [64,65] We see that sparsity produces secondary effects in V1, such as orientation selectivity and variability in the RFs of neurons. This appears to be a common strategy in the brain. For example, it is shown computationally that homeostatic processes, which aim to balance the activity in the brain, also generate neural networks that endow context sensitivity to RFs [60], and connectivity patterns with different degrees of specificity, flexibility, and robustness [61].
We finally highlight that Rozell and colleagues have showed in [62] that the coding step solved by iterative thresholding algorithms can be better reformulated as a recurrent neural network, suggesting that an implementation producing sparse codes is plausible in the brain. In a neural network, indeed, a unit that is active inhibits other units with similar response features, a process that effectively minimizes the energy function (2). Compared to feedforward topologies, prevalent in deep neural networks, this kind of recurrent competition induces lateral inhibition between units and has been shown to be more robust against noisy stimuli and adversarial attacks [63]. Our results indicate that an inhibition stronger than the one induced by the ℓ 1 penalty could push for the generation of more diverse RFs and is in agreement with recent experimental findings [33].

14/26
Supporting information S1  Table Values of learning rate r, µ and relative weight λ for the different algorithms for 500 units.  The values for µ, are the same for different number of units as shown in S1 Table.