Criteria for sensory coding

JEAN-PIERRE NADAL
Laboratoire de Physique Statistique de l'E.N.S.
Laboratoire associé au C.N.R.S. (U.R.A. 1306), à l'ENS, et aux Universités Paris VI et Paris VII
Ecole Normale Supérieure
24, rue Lhomond, F-75231 Paris Cedex 05, France

and

NESTOR PARGA
Departamento de Física Teórica
Universidad Autónoma de Madrid
Cantoblanco, 28049 Madrid, Spain


This text is based on a talk given at the meeting "Towards a Theoretical Brain", Fondation des Treilles, April 1995.


Taking over the original ideas of H. Barlow, a systematic approach to the modeling of sensory systems is being developed since the last 10 years. The general scheme is as follows:

(1) bet on what is the task fulfilled by the particular sensory system under consideration;

(2) define a criterion (an objective function) that characterizes the performance of a system performing (or trying to perform) this task;

(3a) compute the optimal performance that could be obtained, from a mathematical point of view, given the signal to noise ratio (in particular taking into account the noise at the level of receptors); (3b) compute the optimal performance taking also into account the architecture and other constraints specific to the studied system;

(4) compare with experimental data.

From step (3), one may hope to:

Step (1) may be relatively easy when dealing with simple animals. For instance one finds motion detectors in the fly visual system, which provide velocity estimations readily used by the motor system. In such case (see the contribution from W. Bialek at this meeting), the Bayesian inference framework (step 2) allows to define the optimal estimator, given the statistics of the signal and the noise level in the receptors (step 3). A different situation is when dealing with, say, the human visual system. Even though independent channels exists, it is clear that many different tasks have to be solved from the same incoming optical flow. One may thus assume that the first layers in the sensory pathway are building a non specific neural representation, or "code", a priori efficient for further processing. Yet,this hypothesis is not enough for specifying an objective cost function (step 2). Indeed, various criteria have been proposed in the literature, among which several based on information theoretic criteria. The simplest one, studied by various authors, is what has been called the "infomax principle" by R. Linsker: one ask for a neural network which will maximize the mutual information between the output (the neural representation) and the input (say the visual stimuli). The receptors and neural noises, and the finite amount of available resources (number of neurons, synaptic resources) limit the amount of information that can be conveyed by the network on the input, and this limitation renders the maximization a conceptually interesting problem and a generally difficult practical problem.

Barlow's proposal was qualitatively different. According to him, it is not only the preservation of information that matters, but more importantly the information presentation: the neural code should be easily readable by the system behind. This imply a compression of information (one should take advantage of the regularities in the stimuli, coding only what makes each stimulus unique), and the search for a code where each neuron is coding for features statistically independent from those coded by the other neurons. These aspects are subsumed in the notion of "redundancy reduction", and the optimal code that achieve redundancy reduction is a factorial code.

The most detailed analytical studies have been performed for simple feedforward linear networks, with Gaussian input distributions, for both the infomax principle (Linsker 1988, van Hateren 1992) and various implementations of the redundancy reduction principle (see e.g., Barlow et al 1989, Atick 1992, Redlich 1993, Li and Atick 1994). Some of these works include a model for the neural code in V1, which takes into account altogether contrast, color and motion sensitivities, as well as stereo-vision, in a multiscale representation (Li and Atick 1994). The predictions from these calculations are in qualitative agreement with known facts on RF of ganglion and V1 cells, and in some cases in quantitative agreement with contrast sensitivity curves obtained in psychophysical experiments. From the theoretical point of view, what is striking is the extreme similarity between the predictions derived from these various criteria: clearly one cannot claim to have pointed out to a basic organization principle if other criteria lead to almost identical results! One may think that the similarity in the results is due to the use of a linear processing onto a Gaussian distribution. In fact, one can see that any "reasonable" criterion will lead to a principal component analysis, with details depending on the particular constraints under which optimization is performed. Still, the linear-Gaussian system remains quite interesting. One can show (Del Giudice et al, 1995) that the maximization of mutual information, with given input and output additive noises, leads to the following features:

  1. there exists a large family of equivalent solutions (one solution being characterized by a particular choice of synaptic couplings - or RF -); this freedom allows for taking into account various constraints if needed;
  2. the optimal processing, performed by the network onto the input signal with any of these solutions, amounts to performing two steps: (1) a redundancy reduction, finding the m largest principal component (m depending on the noises levels and on the constraints); (2) a redundancy increase, allocating a specific amount of resource to each component (e.g. several neurons), again according to the noises levels and to the constraints.
It is interesting to see that one may choose a compact solution with exactly m output neurons, or distributed solutions with any number (at least equal to m) of output neurons. This may be relevant for the understanding of the huge increase in the number of cells from LGN to V1.

However, as soon as one takes into account any nonlinear aspect in the processing (in particular the saturation of the transfer functions), one readily finds (Linsker 1988) that the large freedom we had in the choice of the solution disappears. It is thus quite important to understand the role of the nonlinearities. Recently we have considered simple feedforward networks with arbitrary non linear transfer functions and arbitrary input (signal) distributions. We have shown (Nadal and Parga 1994) that, in the low noise limit, the maximization of mutual information, if performed over both the synaptic couplings and the choice of the transfer functions, leads to a factorial code - hence to redundancy reduction à la Barlow! Interestingly, this result can be shown to be related to studies in signal processing on blind source separation (which is decorrelation in the time domain) (see e.g. Comon, 1994). In particular, it implies that the mutual information can be used as a cost function for performing blind source separation (Nadal and Parga 1994). This has been turned into algorithms showing promising performance on particular applications (Bell and Sejnowski 1995).

One should emphasis that our result, that is factorization as part of the optimal solution when information maximization is performed, occurs also with any non linear processing before the output neurons. This means that, for any input distribution being a possibly non linear mixture of independent spatio-temporal signals, there exists a network, may be linear, may be non linear with one or more layers, with which the output will convey as much information as possible by providing a factorial representation. This representation, although not unique, has much less degeneracy than what we have in the linear-Gaussian case, and reflects directly the statistical structure of the input data, again much more than with a Gaussian input. Moreover, this decorrelation may be obtained at any "low" or "high" level of processing (one may ask for factorization at, say, the fourth layer, which thus do not tell us how are the neural codes in the first layers). To conclude, we have shown that, at least in the low noise limit, maximization of information implies factorization. However, understanding the possible consequences of the infomax principle requires a much better understanding of the statistical structure of the signal (in the case of the visual system, one may ask for the structure at the "pixel" level, or may be at the "object" level, as suggested by J.-M. Morel at this meeting). It remains also to understand what subsists when one consider stochastic, say spiking, neurons.

References

Atick J. J. Could information theory provide an ecological theory of sensory processing. Network: Computation in Neural Systems, 3:213-251, 1992.

Barlow H. B. The coding of sensory messages. In W. H. Thorpe and O. L. Zangwill, editors, Current Problems in Animal Behaviour, pages 331-360. Cambridge University Press, 1960.

H. B. Barlow, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy codes. Neural Comp., 1:412-423, 1989.

A. Bell and T. Sejnowski. An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7:1129-1159, 1995.

P. Comon. Independent component analysis, a new concept ? Signal Processing, 36:287-314, 1994.

P. Del Giudice, A. Campa, N. Parga, and J.-P. Nadal. Maximization of mutual information in a linear network: a detailed study, Network: Computation in Neural Systems, 6:449-468, 1995.

van Hateren J.H. Theoretical predictions of spatio-temporal receptive fields of fly LMCs, and experimental validation. J. Comp. Physiology A, 171:157-170, 1992.

Li Z. and Atick J. J. Efficient stereo coding in the multiscale representation. Network: Computation in Neural Systems, 5:1-18, 1994.

Linsker R. Self-organization in a perceptual network. Computer, 21:105-17, 1988.

Nadal J.-P. and Parga N. Nonlinear neurons in the low noise limit: a factorial code maximizes information transfer Network: Computation in Neural Systems, 1994, 565-581

A. N. Redlich. Redundancy reduction as a strategy for unsupervised learning. Neural Comp., 5:289-304, 1993.



(c) Jean-Pierre Nadal and Nestor Parga