ebook img

Maximally informative "stimulus energies" in the analysis of neural responses to natural signals PDF

6.8 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Maximally informative "stimulus energies" in the analysis of neural responses to natural signals

Maximally informative “stimulus energies” in the analysis of neural responses to natural signals Kanaka Rajan and William Bialek Joseph Henry Laboratories of Physics and Lewis–Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544 (Dated: January 4, 2012) Theconceptoffeatureselectivityinsensorysignalprocessingcanbeformalizedasdimensionality reduction: in a stimulus space of very high dimensions, neurons respond only to variations within some smaller, relevant subspace. But if neural responses exhibit invariances, then the relevant subspace typically cannot be reached by a Euclidean projection of the original stimulus. We argue that, in several cases, we can make progress by appealing to the simplest nonlinear construction, identifying the relevant variables as quadratic forms, or “stimulus energies.” Natural examples include non–phase–locked cells in the auditory system, complex cells in visual cortex, and motion– 2 sensitive neurons in the visual system. Generalizing the idea of maximally informative dimensions, 1 we show that one can search for the kernels of the relevant quadratic forms by maximizing the 0 mutual information between the stimulus energy and the arrival times of action potentials. Simple 2 implementations of this idea successfully recover the underlying properties of model neurons even n when the number of parameters in the kernel is comparable to the number of action potentials a andstimuliarecompletelynatural. Weexploreseveralgeneralizationsthatallowustoincorporate J plausible structure into the kernel and thereby restrict the number of parameters. We hope that this approach will add significantly to the set of tools available for the analysis of neural responses 1 to complex, naturalistic stimuli. ] C N I. INTRODUCTION informative stimulus energies using methods that don’t require special assumptions about the structure of the . o input stimulus ensemble. We illustrate these ideas on A central concept in neuroscience is feature selectiv- bi ity: as our senses are bombarded by complex, dynamic modelneurons, andexploretheamountofdatathatwill - be needed to use these methods in the analysis of real q inputs, individual neurons respond to specific, identifi- neurons. [ able components of these data [1, 2]. Neurons early in a processingpathwayarethoughttobesensitivetosimpler 1 features [3, 4], and one can think of subsequent stages of v II. MOTIVATION processing as computing conjunctions of these features, 1 2 so that neurons later in the pathway respond to more Tomotivatetheproblemsthatweaddress, letusstart 3 complex structures in the sensory world [5]. A major bythinkingaboutanexamplefromtheauditorysystem. 0 challenge for theory is to make this intuition mathemat- 1. ically precise, and to use such a precise formulation to This starting point is faithful to the history of our sub- ject, since modern approaches for estimating receptive 0 build tools that allow us to analyze real neurons as they fields and filters have their origins in the classic work 2 respond to naturalistic inputs. There is a long history of 1 such work, but much of it rests on the identification of of de Boer and coworkers on the “reverse correlation” : method [7], which was aimed at separating the filtering v “features” with filters or templates. Filtering is a linear ofacousticsignalsbytheinnerearfromthenonlinearities i operation, and matching to a template can be thought X ofspikegenerationinprimaryauditoryneurons. Wewill of as a cascade of linear and nonlinear steps. As we will r see, however, there are many examples of neural feature seethatmathematicallyidenticalproblemsariseinthink- a ingaboutcomplexcellsinvisualcortex,motionsensitive selectivity, well known from experiments on visual and neurons throughout the visual pathway, and presumably auditory systems in many organisms, for which such a in other problems as well. description in linear terms does not lead to much simpli- We begin with the simplest model of an auditory neu- fication. ron. Ifthesoundpressureasafunctionoftimeiss(t), it In this paper we use examples to motivate the sim- is plausible that the activity of a neuron is controlled by plest nonlinear definition of a feature, in which the rele- some filtered version of this stimulus, so that the proba- vant variable is a quadratic form in stimulus space. Be- bility per unit time of generating a spike is causetheresultingvariableisconnectedtothe“energyin frequency bands” for auditory signals, we refer to these (cid:20)(cid:90) (cid:21) r(t)=r g dτf(τ)s(t−τ) , (1) quadratic forms as “stimulus energies.” To be useful, we 0 have to be able to identify these structures in experi- ments where neurons are driven by complex, naturalistic where f(τ) is the relevant temporal filter and g[·] is a inputs. Weshowthat,generalizingtheideaofmaximally nonlinearity; the spikes occur at times t. The statement i informative dimensions [6], we can find the maximally that neurons are tuned is that if we look at the filter in 2 Fourier space, periodic stimuli. Specifically, if s(t) = Acos(ωt) and f˜(ω) = |f˜(ω)|e+iφ, then r(t) = r g[A|f˜(ω)|cos(ωt−φ)]. (cid:90) 0 f˜(ω)= dtf(t)eiωt, (2) So long as there is a nonzero response to the stimulus, thisresponsewillbemodulatedatthestimulusfrequency ω, and more generally if we plot the spiking probability then the magnitude of the filter response, |f˜(ω)|, has a versus time measured by the phase ψ = ωt of the stim- relatively sharp peak near some characteristic frequency ulus oscillation, then the probability will vary with, or ω . IfwechoosethestimuluswaveformsfromaGaussian c “lock” to this phase. white noise ensemble, then the key result of reverse cor- While almost all auditory neurons are tuned, not all relationisthatifwecomputetheaveragestimulusinthe exhibit phase locking. We often summarize the behav- neighborhood of a spike, we will recover the underlying ior of tuned, non–phase–locked neurons by saying that filter, independent of the nonlinearity, they respond to the power in a given bandwidth or to the envelope of the signal at the output of a filter. The (cid:104)s(t−t)(cid:105)∝f(−t). (3) i simplest model for such behavior, which has its roots Weemphasizethatthisisatheorem,notaheuristicdata in our understanding of hair cell responses [8–11], is to analysis method. If the conditions of the theorem are imaginethattheoutputofalinearfilterpassesthrougha met, then this analysis is guaranteed to give the right weak nonlinearity, then another filter. The second stage answerinthelimitoflargeamountsofdata. Ifthecondi- of filtering is low-pass, and will strongly attenuate any tionsofthetheoremarenotmet,thenthespike–triggered signals at or near the characteristic frequency ωc. Then, average stimulus need not correspond to any particular to lowest order, the neuron’s response depends on characteristic of the neuron. (cid:90) (cid:20)(cid:90) (cid:21)2 Eq. (1) is an example of dimensionality reduction. In p(t)= dτf (τ) dt(cid:48)f (t−τ −t(cid:48))s(t(cid:48)) , (8) 2 1 principle, the neuron’s response at time t can be deter- mined by the entire history of the stimulus for times wheref isthebandpassfilterthatdeterminesthetuning t(cid:48) ≤ t. Let us suppose that we sample (and generate) 1 of the neuron and f is a smoothing filter which ensures the stimulus in discrete time steps spaced by dt. Then 2 that the cell responds to the power in its preferred fre- the stimulus history is a list of numbers quencybandwhilerejectinganytemporalvariationator s ≡{s(t),s(t−dt),s(t−2dt),···,s(t−Ddt)}, (4) above the “carrier” frequency. The probability of spik- t ingdependsonthispowerp(t)throughanonlinearity,as where D is the effective stimulus dimensionality, set by before, D = T/dt, and T the longest plausible estimate of the r(t)=r g[p(t)]. (9) integration time for the neural response. We can think 0 of s as a D–dimensional vector. If we know that the t neural response is controlled by a linearly filtered ver- Intuitively, this simple model for a non–phase–locked sion of the sound pressure stimulus, even followed by an neuron also represents a substantial reduction in dimen- arbitrarynonlinearity,thenonlyonedirectioninthisD– sionality – all that matters is the power passing through dimensional space matters for the neuron. Further this a given frequency band, defined by the filter f . On the really is a “direction,” since we can write the response 1 otherhand,wecannotcollapsethismodelintotheonedi- as the Euclidean projection of s onto one axis, or equiv- mensional form of Eq. (5). To be concrete, suppose that alently the dot product between s and a vector W, thefilterf hasarelativelynarrowbandwidtharoundits 1 characteristic frequency ω . Then we can write r(t)=r g(W·s ), (5) c 0 t f (τ)=A(τ)sin(ω τ +φ), (10) where 1 c where the amplitude A(τ) varies slowly compared to the W=dt×{f(0),f(dt),f(2dt),···,f(T)}. (6) period of the oscillation. Let us denote by τ the tempo- 1 This explicit formulation in terms of dimensionality re- ralwidthofA(τ),sincethiscorrespondstothetimeover ductionsuggestsanaturalgeneralizationinwhichseveral which the filter f1 integrates the stimulus, and similarly dimensions, rather than just one, are relevant, τ2 will denote the temporal width of f2(τ). To make sure that the power p(t) does not oscillate at (twice) r(t)=r g(W ·s ,W ·s ,···,W ·s ). (7) the frequency ω , we must force τ (cid:29) 2π/ω . But we 0 1 t 2 t K t c 2 c still have two possibilities, (a) τ (cid:29) τ (cid:29) ω and (b) 1 2 c As long as we have K (cid:28) D, it still holds true that the τ (cid:29)τ (cid:29)ω . If(a)istrue, wecanshowthatp(t)isthe 2 1 c neuron responds only to some limited set of stimulus di- Pythagorean sum of the outputs of two filters that form mensions, but this number is not as small as in the sim- a quadrature pair, plest model of a single filter. (cid:20)(cid:90) (cid:21)2 Notice that if an auditory neuron responds accord- p(t)≈ dt(cid:48)f (t−τ −t(cid:48))s(t(cid:48)) ing to Eq (1), then it will exhibit “phase locking” to α 3 A non-phase-locked auditory neuron model (a) White noise stimulus, s (b) Filter, f1(t) (c) Output of f1(t) 2.0 2.0 2.0 e f1(t)= tsin(ωt)e-t/τ1 d u ormalized signal Normalized amplit Normalized signal N -2.0 -2.0 Time, s 0.015 0 t, Time,s 0.015 -2.0 t, Time, s 0.015 (f) Model : f2(t) [f1(t) s]2|Thr (e) Smoothing, f2(t) (d) Squaring 2.0 SpTikher 2de.0 f2(t)= t e-t/τ2 u nal plit g m d si d a 2 e e x aliz aliz m m or or N N -2.0 -2.0 t ,Time, s 0.015 0 t, Time,s 0.015 x FIG. 1. A non–phase–locked auditory neuron. (a) In this implementation a model neuron responds to a white noise stimulus. (b) The stimulus s is filtered through a temporal filter f , which has the form tsin(ωt)exp(−t/τ ) where ω = 1 1 2π×103Hz and τ = 3ms. (c) The output of f is shown here. The filter f is narrow band, therefore the output oscillates 1 1 1 atthecharacteristicfrequencyevenwhentheinputiswhite. (d)Theoutputoff isfirstsquared,andin(e),convolvedwith 1 a second filter f of the form texp(−t/τ ), with a smoothing time constant τ = 1 ms. (f) The normalized signal is finally 2 2 2 thresholdedtogeneratespikesatameanratethatallowsthespikingprobabilitytobe0.1pertimestep. Weassumethattime runs discretely in steps of dt=1/20000s. (cid:20)(cid:90) (cid:21)2 we schematize the model auditory neuron we have been + dt(cid:48)f (t−τ −t(cid:48))s(t(cid:48)) , (11) β describing, and in Fig. 2 we show the spike–triggered co- variance analysis for models in the two limits, τ (cid:28) τ 2 1 where and τ (cid:29) τ . Indeed we find that in the first case there 2 1 are just two relevant dimensions, a quadrature pair of f (τ)≈A(τ)sin(ω τ +φ) (12) α c filters, whereas in the second case there are many rele- f (τ)≈A(τ)cos(ω τ +φ). (13) β c vant dimensions; these dimensions appear as temporally shifted and orthogonalized copies of the filter f (t). On the other hand, if (b) is true, there is no simple de- 1 composition, and the minimum number of dimensions We can think of a neuron that does not phase lock that we need to describe this model is K ∼τ /dt, which 2 as having an invariance: it responds to acoustic wave- can be quite large. forms that have energy in a relevant bandwidth near We can measure the number of relevant dimensions ω , but it doesn’t discriminate among signals that are usingthespike–triggeredcovariancematrix. Specifically, c shifted by small times. This invariance means that the if the stimulus vector s has components s(t), then we t i cellisnotjustsensitivetoonedimensionofthestimulus, can form the matrix but to many, although these different dimensions corre- ∆C =(cid:104)s(t)s(t)(cid:105) −(cid:104)s(t)s(t)(cid:105), (14) spond, in effect, to the same stimulus feature occurring ij i j t=tspike i j at different times relative to the spike. Thus, we have where in the first term we average over the arrival time a conflict between the notion of a “single feature” and of the spikes and in the second term we average over all themathematicaldescriptionofa“singledimension”via time. IfthespikingprobabilitybehavesasinEq.(7),and linear projection. The challenge is to provide a math- we choose the stimulus from a Gaussian ensemble, then ematical formulation that better captures our intuition. ∆C hasexactlyK nonzeroeigenvalues[12,13]. InFig.1 Before presenting a possible solution, let’s see how the 4 Short smoothing time :τ1 > τ2 (a) Spike-triggered covariance matrix (b) STC eigenvalue spectrum (c) Reconstructed filters -0.015 0.083.5 1 e d e relative to spike , s 0 Eigenvalue alized signal amplitu-11 m 0 m Ti or N 0 -0.08 -1 -0.015 Time relative to spike, s 0 0 Rank 300 -0.015 Time relative to spike, s 0 Long smoothing time :τ1 < τ2 (d) Spike-triggered covariance matrix (e) STC eigenvalue spectrum (f) Reconstructed filters -0.015 0.083.5 1 e d e relative to spike , s 0 Eigenvalue alized signal amplitu ---11111 m 0 m 1 Ti or N 0 -0.08 -1 -0.015 Time relative to spike, s 0 0 Rank 300 -0.015 Time relative to spike, s 0 FIG. 2. Covariance analysis of the non–phase–locked auditory neuron. (a) If the time constant of the smoothing filter f is much shorter than that of the filter f , the spike–triggered covariance matrix has a relatively simple structure. (b) 2 1 Diagonalizingthiscovariancematrixyields2leadingeigenvalues(therestremaincloseto0). (c)Theeigenvectorscorresponding to the 2 non-zero eigenvalues are the reconstructed filters plotted here. These form a quadrature pair, as shown in the inset. (d)Ifthesmoothingtimeofthefilterf islargerthanthatofthefirst,thecovariancematrixhasamuchricherstructure. (e) 2 The spike–triggered covariance matrix decomposes into multiple non-unique eigenvalues. (f) The eigenvectors corresponding to the non-zero eigenvalues give multiple time-shifted copies of the same filter. same problem arises in other cases. movie preceding that moment, Since the classical work of Hubel and Wiesel [14, 15], s ≡{I((cid:126)x,t),I((cid:126)x,t−dt),I((cid:126)x,t−2dt),···,I((cid:126)x,t−T)}. t we know that cells in the primary visual cortex can be (16) classified as “simple” and “complex.” Although Hubel Iftherelevantregionofspaceiswithind×dpixels, then and Wiesel did not give a mathematical description of this stimulus vector lives in a space of dimension D = their data, in subsequent work, simple cells often have d2T/(dt), which can be enormous. As in the discussion been described in the same way that we described the above, Eq. (15) is a restatement of the hypothesis that simplestauditoryneuroninEq(1)[16]. Ifthelightinten- onlyonedirectioninthisspaceisrelevantfordetermining sity falling on the retina varies in space ((cid:126)x) and time (t) theprobabilitythatthesimplecellgeneratesaspike,and as I((cid:126)x,t), we can define a spatiotemporal receptive field “direction”isonceagainaEuclideanorlinearprojection. F((cid:126)x,τ) and approximate the probability that a simple For a complex cell, on the other hand, this single pro- cell generates a spike per unit time as jection is inadequate. Complex cells respond primarily to oriented edges and gratings, as do simple cells, but (cid:20)(cid:90) (cid:90) (cid:21) theyhaveadegreeofspatialinvariancewhichmeansthat r(t)=r0 g d2x dτ F((cid:126)x,τ)I((cid:126)x,t) . (15) their receptive fields cannot be mapped onto fixed zones of excitation and inhibition. Instead, they respond to patterns of light in a certain orientation within a large If,asbefore,weassumethatthestimulusisgeneratedin receptive field, regardless of precise location, or to move- discrete time steps (movie frames) with spacing dt, and ment in a certain direction. Corresponding to this intu- thatthestimulusinfluencesspikesonlywithinsometime ition, analysis of complex cells using the spike–triggered window of duration T, then we can think of the stimulus covariance method shows that there is more than one at any moment in time as being the T/dt frames of the relevant dimension [17]. As with non–phase–locked au- 5 ditory neurons, what defeats the simplest version of di- patterns of light and dark are shifted for the two filters; mensionality reduction in complex cells is the invariance forsimplicitywehaveassumedthatspatialandtemporal of the response, in this case, invariance to small spatial filteringisseparable. Ifweformtheenergy–likequantity displacement of the relevant, oriented stimulus feature. The simplest model of a complex cell is precisely anal- (cid:20)(cid:90) (cid:90) (cid:21)2 ogous to the quadrature pair of filters that emerge in p(t)= d2x dτ F ((cid:126)x,τ)I((cid:126)x,t) 1 the analysis of non–phase–locked auditory neurons. To be concrete, let us imagine that receptive fields are de- (cid:20)(cid:90) (cid:90) (cid:21)2 + d2x dτ F ((cid:126)x,τ)I((cid:126)x,t) , (19) scribed by Gabor patches. The position (cid:126)x includes two 2 orthogonal coordinates in visual space, which we call x 1 and x . Gabor patches have an oscillatory dependence 2 we have a measure of response to oriented features in- on one of these coordinates, but simply integrate along dependent of their precise position, and this provides a the other; both the integration and the envelope of the starting point for the analysis of a complex cell. oscillations are described by Gaussians, so that One more example of the problem we face is provided (cid:20) (cid:18) x2 x2 (cid:19)(cid:21) bythecomputationofmotioninthevisualsystem. There F1((cid:126)x,τ)=cos(kx1)exp − 2σ12 + 2σ22 f(τ), (17) is a classical model for this computation, the correlator 1 2 model, that grew out of the experiments by Hassenstein and the quadrature filter is then and Reichardt on behavioral responses to visual motion in beetles and flies [19]. Briefly, the idea behind this (cid:20) (cid:18) x2 x2 (cid:19)(cid:21) model is that if something is moving at velocity v, then F ((cid:126)x,τ)=sin(kx )exp − 1 + 2 f(τ). (18) 2 1 2σ2 2σ2 the image intensity I(x,t) must be correlated with the 1 2 intensity at I(x+∆,t+τ), where τ = ∆/v. Then we Each of these filters is maximally sensitive to extended can detect motion by computing this correlation and av- features oriented along the x direction, but the optimal eraging over some window in space and time, 2 (cid:90) (cid:90) C (t)= dxW(x) dt(cid:48)f(t−t(cid:48))I(x+∆,t(cid:48))I(x+∆,t(cid:48)+τ), (20) ∆,τ where for simplicity, we think just about a single spatial dimension. In principle, with just one value of the delay τ and one value of the displacement ∆, this correlation “detects” only motion at one velocity, v = ∆/τ, but we can easily generalize this computation to a weighted sum over different values of these spatiotemporal shifts, (cid:90) (cid:90) (cid:90) (cid:88) C(t)= dxW(x) dt(cid:48)f(t−t(cid:48)) dτ F(∆,τ)I(x+∆,t(cid:48))I(x+∆,t(cid:48)+τ). (21) ∆ Depending on the precise form of this weighting we can arrange for the correlation C to have a relatively smooth, graded dependence on the velocity. Intheinsectvisualsystemitseemsnaturaltothinkof phase. More directly, in this case the brain is trying to the correlations in Eq. (21) as being computed from the computeavelocity,moreorlessindependentlyoftheab- outputsofindividualphotoreceptors,whicharetypically solute position of the moving objects. Even in an insect spaced ∼ 1◦ apart. In mammals, we can imagine com- visual system, where computing C corresponds to corre- puting a similar quantity, but we would use the outputs lating the filtered outputs of neighboring photoreceptors of larger retinal or cortical receptive fields [18]. We can inthecompoundeye,thiscomputationisrepeatedacross alsothinkaboutthiscomputationinFourierspace. Ifwe anareacontainingmanyphotoreceptors,andhencethere transform is no way to collapse C down to a function of just one or two Euclidean projections in the stimulus space. (cid:90) dk (cid:90) dω I(x,t)= I˜(k,ω)eikx−iωt, (22) 2π 2π What do these examples – non–phase–locked auditory neurons, complex cells in the visual cortex, and visual then we can think of C as integrating power or energy motion detectors – have in common? In all three cases, P(k,ω)=|I˜(k,ω)|2 over some region in the k−ω plane; the natural, simplest starting point is a model in which motion corresponds to having this power concentrated thebraincomputesnotalinearprojectionofthestimulus along the line ω =vk. Once again there is an invariance ontoareceptivefield, butratheraquadraticform. More to this computation, since as with the non–phase–locked precisely, if stimuli are the vectors s ≡ {s ,s ,···s }, 1 2 D auditory neuron we are looking for power regardless of then Eq’s. (8), (19), and (21) all correspond to comput- 6 ing a quantity Poisson process. They show that this model, and some generalizations, lends itself to a Bayesian formulation, D in which various simplifying structures (see Section IV) T (cid:88) x=s ·Q·s≡ si Qij sj. (23) can be imposed through prior distributions on the possi- i,j=1 ble Q, although it is not clear whether computationally tractable priors correspond to realistic models. In some Generalizing the use of terms such as “energy in a fre- limits, these approaches are equivalent to one another, quency band” and “motion energy,” we refer to x as a andtothesearchformaximallyinformativestimulusen- “stimulus energy.” ergies that we propose here. As far as we can see, the If the matrix Q is of low rank, this means that we Maximally Informative Stimulus Energy method always canbuildx outofprojections ofthestimulusontoacor- involves fewer assumptions, and can capture any of the respondingly low dimensional Euclidean subspace, and structures postulated in the other methods. we can try to recover the relevant structure using meth- ods such as the spike–triggered covariance or maximally informative dimensions. Still, if Q is of rank 5 (for ex- III. CORE OF THE METHOD ample), once we identify the five relevant dimensions we have to explore the nonlinear input/output relation of If the probability of generating an action potential de- theneuronthoroughlyenoughtorecognizewhetherthese pends on the stimulus s, then observing the arrival of dimensions combine in a simple way, to recover our in- evenasinglespikeprovidesinformationaboutthestimu- tuition that there is a single stimulus feature x to which lus. Importantly,thedataprocessinginequality[25]tells the cell responds. This can be prohibitively difficult. usthatifwelooknotatthefullstimulusbutonlyatsome In many cases, it is plausible that neurons could be limitedorcompresseddescriptionofthestimulus–asin- responding just to one energy x, but the underlying ma- gle feature, for example – we can only lose information. trix Q could be of full rank; a clear example is provided If the neuron really is sensitive to only one stimulus fea- by correlator models for wide–field motion sensitive neu- ture, however, it is possible to lose none of the available rons, as in [12, 19]. But in this case there is no real mutual information between spikes and stimuli by focus- “dimensionality reduction” associated with the mapping ing on this one feature. This suggests a natural strategy from s → x, if all we know how to do is to search for for identifying relevant stimulus features, by searching linear projections or Euclidean subspaces. On the other for those which preserve the maximum amount of infor- hand, mapping s → x really is a tremendous reduction mation [6]. Further, we can put the success of such a in complexity, because the full stimulus s is described by search on an absolute scale, by estimating more directly D parameters, while x is just one number. the information that spikes provide about the stimulus Supposethattheresponseofaneurontocomplexstim- [20, 21], and asking what fraction of this information is uli can be described by saying that the probability of captured by the best feature we could find. spiking depends solely on a single stimulus energy x as in Eq. (23), so that A. Setting up the problem T r(t)=r g(s ·Q·s ). (24) 0 t t To make this precise, we recall that the information Our task becomes one of showing that we can recover about the stimulus s that is conveyed by the observation the underlying matrix Q by analyzing the spike train in of a single spike can be written, following [20], as relation to the stimuli, without making any assumptions about the statistics of the stimuli. (cid:90) (cid:20)P(s|spike)(cid:21) As we were finishing this work, we became aware of I = dDsP(s|spike)log bits, (25) spike 2 P(s) two other recent efforts that point to the importance of quadratic forms in stimulus space. It is shown in [27] where P(s) is the stimulus distribution and P(s|spike) that a logistic dependence of the spike probability on is the distribution of stimuli given that a spike occurred a quadratic form is the maximum entropy, and hence at a particular time (the response conditional ensemble least structured, model consistent with a measurement [22]). If we consider a mapping M:s→x, then we can of the spike–triggered covariance matrix, and that this also compute identification is correct without any assumptions about the distributions of inputs. As a practical matter, Ref (cid:90) (cid:20)P(x|spike)(cid:21) I (M)= dxP(x|spike)log , (26) [27] emphasizes the case where the matrix kernel (cor- spike 2 P(x) responding to Q above) is of low rank, and explores examples in which the quadratic form provides an in- knowing that for any mapping M, I (M)≤I . If spike spike cremental improvement on linear dimensionality reduc- we restrict ourselves to some class of mappings, then we tion. In a different direction, the work in [28] consider can search for an M∗ which maximizes I (M∗), and spike models in which the spike probability depends exponen- see how close this is to the real value of I . If the spike tially on a quadratic form, and spiking is explicitly a inputs are Gaussian, then we can find M by standard 7 spike–triggered or reverse correlation methods. Working ulus vectors, there are D(D+1)/2−1 free parameters. with arbitrary (i.e., natural) stimulus ensembles compli- This is a problem both because we have to optimize in catesthings. Further,ifthereisjustonestimulusfeature a space of much higher dimensionality, and because de- x, then computing I (M) involves only probability termining more parameters reliably must require larger spike distributions over a single variable, so there is no curse data sets. We will address these problems shortly, but of dimensionality. Finally, in order for this to work, we let’s start by following the path laid out in [6], which need to make no assumptions about the form of the dis- involves searching for the maximum of I (M) by gra- spike tribution of stimuli P(s), or the inherited distribution of dient ascent. stimulus features P(x). Thus, as first emphasized in [6], If our stimulus feature is the energy in Eq. (23), then searching for maximally informative features provides a the distribution of x is practical and principled way to analyze neural responses (cid:90) to high dimensional, naturalistic stimuli. P (x)= dDsδ(x−sT·Q·s)P(s), (27) Q The work in [6] considered the case where the stim- wherethesubscriptexplicitlydenotesthatP(x)depends ulus features are one or more linear projections of the on Q. We take the derivative of this with respect to an stimulus, x = W ·s, so that the mapping M is pa- α α element Q in the matrix Q, ij rameterized by the vectors {W }; in the simplest case α therebeingjustonevectorW. Hereweareinterestedin ∂P (x) (cid:90) ∂ Q = dDs δ(x−sT·Q·s)P(s) (28) quadratic mappings, corresponding to thestimulusener- ∂Q ∂Q ij ij gies in Eq. (23). Now the mapping M is parameterized (cid:90) byasymmetricmatrixQ. Inprinciple,allthearguments =− dDssisj δ(cid:48)(x−sT·Q·s)P(s).(29) of[6]forthelinearcaseshouldgeneralizetothequadratic case,andwefollowthispath. Beforestarting,wenotean Similarly, we can differentiate the distribution of x con- obviousproblemrelatedtothenumberofparameterswe ditional on a spike, are looking for. If we are searching for a vector W in a D–dimensional stimulus space, we are looking an object ∂PQ(x|spike) =−(cid:90) dDsss δ(cid:48)(x−sT·Q·s)P(s|spike). describedbyDnumbers. Infact,thelengthofthevector ∂Q i j ij is irrelevant, so that maximizing I (M) corresponds (30) spike to optimization in a D−1 dimensional space. But if we Putting these terms together, we can differentiate the are searching for a symmetric matrix that acts on stim- information: ∂I (M) (cid:90) (cid:20)δI (M)∂P (x) δI (M) ∂P (x|spike)(cid:21) spike = dx spike Q + spike Q (31) ∂Q δP (x) ∂Q δP (x|spike) ∂Q ij Q ij Q ij 1 (cid:90) P (x|spike)(cid:90) = dx Q dDsssδ(cid:48)(x−sT·Q·s)P(s) ln2 P (x) i j Q 1 (cid:90) (cid:18) (cid:20)P (x|spike)(cid:21)(cid:19)(cid:90) − dx 1+ln Q dDsssδ(cid:48)(x−sT·Q·s)P(s|spike) (32) ln2 P (x) i j Q 1 (cid:90) (cid:90) d (cid:20)P (x|spike)(cid:21) =− dx dDsssδ(x−sT·Q·s)P(s) Q ln2 i j dx P (x) Q 1 (cid:90) P (x) (cid:90) d (cid:20)P (x|spike)(cid:21) + dx Q dDsssδ(x−sT·Q·s)P(s|spike) Q . (33) ln2 P (x|spike) i j dx P (x) Q Q But we notice that (cid:90) dDsss δ(x−sT·Q·s)P(s)=(cid:104)ss|x(cid:105)P (x), (34) i j i j Q where (cid:104)ss|x(cid:105) is the expectation value of ss conditional on the value of the stimulus energy x, and similarly i j i j (cid:90) dDsssδ(x−sT·Q·s)P(s|spike)=(cid:104)ss|x,spike(cid:105)P (x|spike), (35) i j i j Q where (cid:104)ss|x,spike(cid:105) is the expectation value conditional on the energy x and the occurrence of a spike. We can i j combine these terms to give ∂I (M) (cid:90) d (cid:20)P (x|spike)(cid:21) spike = dxP (x)[(cid:104)ss|x,spike(cid:105)−(cid:104)ss|x(cid:105)] Q , (36) ∂Q Q i j i j dx P (x) ij Q 8 or, more compactly, ∇ I =(cid:90) dxP (x)(cid:2)(cid:104)ssT|x,spike(cid:105)−(cid:104)ssT|x(cid:105)(cid:3) d (cid:20)PQ(x|spike)(cid:21). (37) Q Q dx P (x) Q Tolearnthemaximallyinformativeenergy,orthebestchoiceofthematrixQ,wecanascendthegradientinsuccessive learning steps, Q→Q+γ ∇ I whereγ issmall. (38) Q B. Multiple matrices Somebiologicalexamplesofthisformulationincludegain control or normalization in V1 [17], optimal estimation In the same way that the idea of linear projection can theory of motion detection in visual neurons of insects be generalized to have the probability of spiking depend [12]andcomplexspectrotemporalreceptivefieldsofneu- on multiple linear projections, we can generalize to the ronsresponsibleforsongprocessinginsongbirds[23,24]. case where the are multiple relevant stimulus energies. The inference task becomes one of estimating both ma- PerhapsthesimplestexampleEq.(24)canbegeneralized trices Q1 and Q2 by information maximization. to 2 matrices, is the computation of a (regularized) ratio between two stimulus energies, so that the probability of spiking varies as (cid:18) sT·Q ·s (cid:19) r =r g 1 . (39) 0 1+sT·Q ·s 2 As before, we can compute the gradient, noting that this time, there are two different gradients of I (M), spike ∇Q1I =(cid:90)dxPQ1(x)(cid:20)(cid:28)1+sTss·TQ ·s(cid:12)(cid:12)(cid:12)(cid:12)x,spike(cid:29)−(cid:28)1+sTss·TQ ·s(cid:12)(cid:12)(cid:12)(cid:12)x(cid:29)(cid:21)ddx(cid:20)PQP1(x|(sxp)ike)(cid:21) and, 2 2 Q1 (cid:90) (cid:34)(cid:42) sT·Q ·s (cid:12)(cid:12) (cid:43) (cid:42) sT·Q ·s (cid:12)(cid:12) (cid:43)(cid:35) d (cid:20)P (x|spike)(cid:21) ∇ I =− dxP (x) 1 ssT(cid:12)x,spike − 1 ssT(cid:12)x Q2 (.40) Q2 Q2 (1+sT·Q2·s)2 (cid:12)(cid:12) (1+sT·Q2·s)2 (cid:12)(cid:12) dx PQ2(x) Analogous to Eq. (38), at every learning step, we update each matrix Q by the appropriate ith gradient, i Q →Q +γ∇ I. (41) i i Qi where i = 1,2 for the x in Eq. (39). In principle the formalism in Eq. (39) could yield a more complete description of a neuron’s nonlinear response properties compared to a single quadratic kernel convolving the stimulus, but there are some data-requirement challenges which we will address later. C. Technical aspects of optimization An example of these ideas is shown in Fig. 3. We used verysmallpatchesfromnaturalimagesasinputs,reshap- ing the intensities in nearby pixels into a D–component stimulus vector s where D =10. To describe the neuron In order to implement Eq. (38) as an algorithm, we we chose a random symmetric matrix K to use as the havetoevaluatealltherelevantprobabilitydistributions kernel, and generated spikes when the stimulus energy, and integrals. In practice, this means computing x for sT·K·s, crossed a threshold, as illustrated in Fig. 3a. all stimuli, choosing an appropriate binning along the Wefixedthemeanrateofthespikingresponsesuchthat x–axis, and sampling the binned versions of the spike– theprobabilityofaspikeoccurringinonebinofduration triggeredandpriordistributions. Wecomputetheexpec- tation values (cid:104)ssT(cid:105) separately for each bin, approximate dt is 0.1, and we generated ∼1000 spikes. We then tried to extract the neuron’s receptive field by starting with a the integrals as sums over the bins, and derivatives as random initial matrix Q, and following the gradient of differences between neighboring bins. To deal with local mutual information, as in Eq. (38). extrema in the objective function, we use a large start- ing value of γ and gradually decrease γ during learning. We let the one parameter of the algorithm, γ, gradu- This basic prescription can be made more sophisticated, allydecreasefromastartingvalueof0.5to0.05,inorder but we do not report these technical improvements here. to minimize the fluctuations around the true maximum 9 (a) Example (b) Normalized mutual information (c) Reconstruction error Natural stimuli, s 1 1 5 0. Random initialization 2 > 2 STC initialization n0.8 Q )0.8 Model matrix, K ormatio0.6 O-p2timal matrix, Q <or, (K - 0.6 sT.K.s |Thr alized inf0.4 Intermediate 2 ction err0.4 m u Nor0.2 nstr0.2 Spikes -2 co Initial guess Model matrix, K Re 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Time, s Learning steps Learning steps FIG.3. Core of the method. (a)Ageneralimplementationisshownhere. Thestimulisarenaturalimageclipswhichare D×D pixel patches resized from a natural image database, as described in [26]. 1000 spikes are generated with a probability T per time bin of 0.1 from the model neuron by a thresholding the term, s ·K·s where the 10×10 matrix K is the receptive field of the neuron. (b) Mutual information between the spiking response of the model neuron and the quadratic stimulus projectionxisplottedasafunctionofthenumberoflearningsteps. Information,normalizedbyitsvaluewhenK=Q,peaks atthe40th learningstepandthenplateaus. The3blackdotsonthetracedenotethepointsatwhichweextracttheinitial,the intermediate and the optimal matrices. The maximally informative matrix Q reconstructed at the 40th step, agrees well with K,indicatingconvergence. Forthisimplementationthestepsizeγ =0.5atthestartand0.05attheendofthealgorithm. (c) Root–mean–square(RMS)reconstructionerrorcalculatedas(cid:104)(K−Q)2(cid:105)1/2,isplottedasafunctionofthenumberoflearning steps. This error decreases steadily until either the randomly initialized matrix (solid line) or the matrix initialized to the spike–triggered covariance matrix (dashed line) matches K. If Q is initialized to the covariance matrix, the initial RMS error is smaller and the convergence is faster (30th learning step) compared to that for a randomly initialized Q. For this example, both K and Q are 10×10 matrices and the black dot on the solid trace is at the same learning step as in panel (b). of the information. Mutual information, the red trace measures and data requirement issues next. in Fig. 3b, peaks at the 40th learning step and remains unchanged after that. The 3 black dots in Fig. 3b cor- respond to the steps during the optimization when we D. Performance measures and data requirements extract and plot the initial guess, the intermediate and the optimal/maximally informative matrix Q. It is in- teresting to note that the intermediate matrix appears We assume that spiking responses of the system are completelydifferentfromtheoptimalQeventhoughthe lessfrequentthanperiodsofsilenceandthereforetheac- correspondingmutualinformationisrelativelyclosetoits curacy of our reconstruction should depend of the num- maximum(asimilarobservationwasmadeinthecontext ber of spikes N produced by the model (or recorded spikes of maximally informative dimensions [6]). dataintheexperimentalcontext),independentoftheac- InFig.3ctheroot–mean–square(RMS)reconstruction tual threshold. Further, we expect that when the stim- error (cid:104)(K−Q)2(cid:105)1/2 is plotted as a function of the num- ulus dimensionality D increases, the fact that there are ber of learning steps for a randomly initialized Q (solid more parameters needed to describe the kernel K means line) and when Q is initialized to the spike–triggered co- thatperformancewilldeteriorateunlesswehaveaccessto variance (STC) matrix (dashed line). RMS error at the moredata. Toaddresstheseissuesweexploremodelneu- start of the algorithm ≈ 1 when the “true” matrix K rons as in Fig. 3, but systematically vary D and Nspikes. and the initial guess for Q are symmetric, random ma- Again we consider the most difficult case, with natural- trices, uncorrelated with each other, but is slightly lower istic stimuli and kernels that are random symmetric ma- when Q is initialized to the STC. This difference be- trices. The results are shown in Fig. 4. comessmallerasthestimulusdimensionalityDincreases Ourintuitionisthatthenumberofspikesshouldscale or as the stimulus departs more strongly from Gaussian- withthenumberoffreeparametersinthemodel,D(D+ ity. Both traces decrease, and stop changing once our 1)/2,sowealwaysstartwithN =D(D+1)/2. Ifwe spikes estimate of the optimal Q matches K. This occurs at normalize the reconstruction errors by this initial error, the 40th step for the randomly initialized Q and slightly and scale N by the number of free parameters, the spikes sooner(30th step)wheninitializedtotheSTC.Ifwehad data collapse, as shown in Fig. 4b. Evidently accurate fewer spikes, our estimate for the optimal Q could still reconstructionsofverylargematricesrequiresverymany matchKadequately, buttheactualRMSerrorinrecon- spikes. This suggests that it will be important to have struction might be higher. We explore such performance some more constrained models, which we now explore. 10 (a) Reconstruction error vs. Nspikes (b) Data collapse 0.5100 1.0 > 2 K - Q ) 2>0.5 0.8 <( Q ) n error, 10-1 <d (K - 0.6 ctio alize0.4 u m r onst DD == 1200 Nor0.2 c D = 40 Re -2 D = 50 10 102 103 104 105 106 0 200 400 600 800 1000 Nspikes 2Nspikes / D(D+1) FIG. 4. Data requirement and performance issues. (a) Reconstruction error (cid:104)(K−Q)2(cid:105)1/2 is plotted as a function of number of spikes (N ) for matrices corresponding to stimuli of increasing dimensionality (D = 10,20,40 & 50). (b) The spikes tracesfordifferentvaluesofD collapsewhentheerrorisnormalizedbythe(maximum)valueatN =D(D+1)/2foreach spikes D, and plotted as a function of 2N /(D(D+1)). spikes IV. CONSTRAINED FRAMEWORKS orthogonal vectors {v } such that i p (cid:88) T Q= v v . (42) i i i=1 Intermsofthesevectors, thestimulusenergyisjustx= The most general stimulus energy is described by (cid:80)p (vT·s)2. D(D+1)/2 parameters, and this quickly becomes large i=1 i The low rank approximation reminds us of the sim- for high dimensional stimuli. In many cases it is plau- pler, Euclidean notion of dimensionality reduction dis- sible that the matrix kernel of the stimulus energy has cussed above. Thus, we could introduce variables x = some simpler structure, which can be used to reduce the i vT · s for i = 1,2,...,p. The response would then number of parameters. i be approximated as depending on all of these variables, r(x ,x ,...,x ), as in Eq. (7). In the stimulus energy 1 2 p approach, all of these multiple Euclidean projections are combined into x = (cid:80) x2, so that have a more i i One way to simplify the description is to use a matrix constrained but potentially more tractable description. that has low rank. If, for example, the rank of the ma- When Q is written as Q=(cid:80)p v vT, the relevant gra- i=1 i i trix Q in Eq. (23) is p ≤ D, then we can find a set of dient of information, analogous to Eq. (37) is ∇viI =2(cid:90) dxPQ(x)(cid:2)(cid:10)s(viTs)(cid:12)(cid:12)x,spike(cid:11)−(cid:10)s(viTs)(cid:12)(cid:12)x(cid:11)(cid:3)ddx(cid:20)PQP(x|(sxp)ike)(cid:21), (43) Q and we can turn this into an algorithm for updating our Anotherwayofconstrainingthekernelofthestimulus estimates of the v , energy is to assume that it is smooth as we move from i one stimulus dimension to the next. Smooth matrices v →v +γ∇ I. (44) i i vi can be expanded into weighted sums of basis functions, There is a free direction for the overall normalization of M thematrixQ[6,17]whichmakesthemutualinformation Q= (cid:88)α B(µ), (45) µ invariant to reparameterization of the quantities. We or- µ=1 thogonalize the vectors v during the learning procedure i to make sure that they don’t grow out of bound. and finding the optimal matrix then is equivalent to cal-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.