When David Field and Bruno Olshausen showed that the primary cortex was getting inputs from our eyes as a set of sparse functions that looked like ridgelets and curvelets, it became obvious that one result was missing: If natural images are sparse and our eye system has sparse receptors, is there a way our brain finds a sparse decomposition of the world in a way that works in a linear fashion? The thinking goes that our brain is really capable of understanding scenes without an iteration process (an iteration process is nonlinear and has a high cost in terms of energy). When Emmanuel Candes and David Donoho showed that in fact, non-adaptive schemes using curvelets could decompose natural images it became obvious that a good parallel could be made between the physiology of the primary cortex and this new type of decomposition. But how do you do this decomposition ? While an m-term curvelet expansion of a scene can be thresholded and can rival with complex adaptive approximation schemes, it does not answer how the primary cortex eventually comes up with that m number.
The state of the art on our thinking about the primary cortex can be found here in this review by Graham and Field on sparse coding in the neocortex. It specifically addresses the bounds on the primary cortex induced by the metabolic constraints:
We conclude our discussion by returning to the issue of metabolic constraints. Could we argue that primary evolutionary pressure driving towards sparse coding is one related to the metabolic costs of neural firing? As noted earlier, both Attwell and Laughlin (2001) and Lennie (2003) argue that there are not enough resources to achieve anything but a low-activity system. Moreover, when we find sparse activity in frontal cortex (Abeles et al., 1990), it is more difficult to argue that the sparse activity must arise because it is mapping the sparse structure of the world. Even at early levels, if sparseness were metabolically desirable, there are a number of ways of achieving sparseness without matching the structure of the world. Any one of a wide variety of positively accelerating nonlinearities would do. Simply giving the neurons a very high threshold would achieve a sparse code, but the system would lose information. We argue that the form of sparse coding found in sensory systems is useful because such codes maintain the information in the environment, but do so more efficiently. We argue that the evolutionary pressure to move the system towards a sparse code comes from the representational power of sparse codes. However, we do accept that metabolic constraints are quite important. It has been demonstrated that at the earliest levels of the visual system, ganglion cells (Berry, Warland and Meister, 1997) and lateral geniculate nucleus cells (Reinagel and Reid, 2000) show sparse (non-Gaussian) responses to temporal noise. A linear code, no matter how efficiently it was designed would not show such sparse activity, so we must assume that the sparseness is at least in part due to the nonlinearities in the system and not due to the match between the receptive fields and the sparse structure of the input. As with the results show sparse responses in non-sensory areas, we must accept that metabolic may also be playing a significant role.
Well it's nice to acknowledge we have physical limitations, but to assume that a linear code cannot exist simply because we currently do not have a model for it, is probably assuming too much. So what do we know ? the primary cortex is a low energy system which basically removes from consideration any complex system that requires resources (like an iteration system). This situation favors a linear system but so far, we have not found a good model for that. There is something deeper still. Even if we knew much about the sparsity of a scene, understanding the brain is really about understanding how large amount of information is reduced when traveling from the eye into the brain. In other words, we need to reduce the amount of data that our megapixel sensor called the eye is bringing in, and we must do this very fast (30 times a second). To put things in perspective, let us take an example: let us imagine we are seeing a scene where somebody waves a hand. If we were to take a video of this scene, we would probably get a 40 MB avi file (uncompressed). That file could then be compressed to 1 MB using MPEG for instance. While the compression is impressive, it is not impressive enough. In effect, when our brain sees this video, it can remember the hand and how it moved. The movement is probably two dimensional and so the brain really remembers the two parameters needed to produce a hand that waves in the manner that is shown in the video. In other words, the brain is probably not storing 1 MB of information when it stores this hand waving activity, it is most probably storing how two parameters changed over time, which in many occasion is much less than 1 MB of data. The real question becomes: Is there a way to reduce that the 1MB information further ? We are not asking ourselves if the input or the receptor are sparse (it is a necessary condition), we are interested in answering the question on how sparser we can make this information by using the connection between these sparse elements. Can we reduce the dimensionality of the signal further and exploit it ?
Enters Dimensionality reduction: Ever since the discovery of dimensionality reduction schemes (LLE, Isomap, Laplacian-diffusion..) that are taking high dimensionality data and and are able to map them into low dimensionality manifolds, researchers have been trying to extend these techniques to wider sets of problems. In the cognition world for instance, Jonathan Pillow and Eero Simoncelli perform dimensionality reduction applied to neural models but it is not obvious how these techniques directly translate into a specific functionality of the primary cortex even if they take an example of a V1 cell. It is also not obvious how some of these techniques are robust to noise. But as stated earlier, there are different ways to go about dimensionality reduction. One of the most intriguing which has robustness built into it is Compressed Sensing. Compressed Sensing has the ability to produce a robust decomposition of a manifold. Mike Wakin looked into that during his dissertation and found that smooth manifolds can be readily compressed using Compressed Sensing thereby making it a very simple solution to dimensionality reduction (see R. G. Baraniuk and M. B. Wakin in Random Projections of Smooth Manifolds ) but as Donoho and Grimes had shown earlier, sharp objects such as arms, legs have edges and that makes the manifold non-differentiable. This is a problem because it means that one cannot easily extract parameters from a video if we have these sharp edges. In order take care of that problem, one can be inspired by Biology i.e. to smooth these images using Gabor wavelets as in the human vision system (Object Recognition with Features Inspired by Visual Cortex by Thomas Serre, Lior Wolf and Tomaso Poggio) and then use the random projection of smooth manifolds to eventually figure out the parameters of the movements ( for more information on how to do that see High-resolution navigation on non-differentiable image manifolds or the Multiscale Structure of Non-Differentiable Image Manifolds)
[in Object Recognition with Features Inspired by Visual Cortex by Thomas Serre, Lior Wolf, Tomaso Poggio one may note that one can only be struck by the pain the algorithm goes through into in order to be robust.
- S1: Apply a battery of Gabor filters to the input image. The filters come in 4 orientations θ and 16 scales s (see Table 1). Obtain 16×4 = 64 maps (S1)sθ that are arranged in 8 bands (e.g., band 1 contains filter outputs of size 7 and 9, in all four orientations, band 2 contains filter outputs of size 11 and 13, etc).
- C1: For each band, take the max over scales and positions: each band member is sub-sampled by taking the max over a grid with cells of size NΣ first and the max between the two scale members second, e.g., for band 1, a spatial max is taken over an 8 ×8 grid first and then across the two scales (size 7 and 9). Note that we do not take a max over different orientations, hence, each band (C1)Σcontains 4 maps.
- During training only: Extract K patches Pi=1,...K of various sizes ni × ni and all four orientations (thus containing ni × ni × 4 elements) at random from the (C1)Σ maps from all training images.
- S2: For each C1 image (C1)Σ, compute: Y = exp(−γ||X − Pi||2) for all image patches X (at all positions) and each patch P learned during training for each band independently. Obtain S2 maps (S2)Σi .
- C2: Compute the max over all positions and scales for each S2 map type (S2)i (i.e., corresponding to a particular patch Pi) and obtain shift- and scale-invariant C2 features (C2)i , for i = 1 . . .K.
]
Besides Wakin, Donoho, Baraniuk and others collaborating with them, few have made that connection. Yves Meyer makes a passing reference to the use of compressed sensing in physiology (in Perception et compression des images fixes.) Gabriel Peyre however is a little more specific here.
Analogies in physiology:I am mentioning Gabriel Peyre's work because he works on bandelets. I had mentioned bandelets before in the context of an announcement made by the company Let it wave (headed by Stephane Mallat ) where they showed that faces could be compressed down to 500 bytes or the size of a bar code.
This compressed sampling strategy could potentially lead to interesting models for various sensing operations performed biologically. Skarda and Freeman have proposed a non-linear chaotic dynamic to explain the analysis of sensory inputs. This chaotic state of the brain ensures robustness toward unknown events and unreliable measurements, without using too many computing resources. While the theory of compressed sensing is presented here as a random acquisition process, its extension to deterministic or dynamic settings is a fascinating area for future research in signal processing.
If bandelets provide a recognizable Faces nonlinearly with 500 bytes, one only needs 5 x 500 bytes = 2.5 KB random samples (within the meaning of Compressed sensing) of that Face to be able to reconstruct it. 2.5KB is better than 10 MP or 1 MB. The number 5 is the bound for compressed sensing (more specific asymptotic laws/results can be found in the summary of Terry Tao)
However, the strongest result so far is the one found by Mike Wakin on random projection on a manifold where he uses neighborhood criteria in the compressed sensing space to permit an even better reconstruction than just assuming sparsity. In the figure below, one can compare the 5 random projection results using the Manifold based recovery from the traditional algorithms used like Orthogonal Matching Pursuit and Basis Pursuit.
Naturally, an extension of this is target detection as featured in the Smashed Filter article from Rice. With this type of result/framework, I am betting we can go lower than 2.5 KB of universal samples to characterize a face. The connection between cognition and compressed sensing and autism is simple: Face processing seem deficient for people affected with autism and we don't know why. A model based on the elements I just mentioned might give an insight into this issue.
I guess there's nothing stopping one adding a "cost of firing" term (that favors sparsity) into just about any neural network model.
ReplyDeleteOn a bit of a tangent, have you looked at dual-control theory at all? This addresses control systems that balance the cost of obtaining information against the future value of having that information. Here the cost might be neural activity.
This is a good idea.
ReplyDeleteNoi have not looked at dual control theory but I will now that youmention it. Thanks.
Igor.