Artificial Neural Networks Ronan Collobert [email protected] Introduction: Neural Networks in 1980 2 Introduction: Neural Networks in 2011 x W1 tanh( ) W2 score × • • × • Stack matrix-vector multiplications interleaved with non-linearity Where does this come from? How to train them? Why does it generalize? What about real-life inputs (other than vectors x)? Any applications? 3 Biological Neuron Dendrites connected to other neurons through synapses Excitatory and inhibitory signals are integrated If stimulus reaches a threshold, the neuron fires along the axon 4 McCulloch and Pitts (1943) Neuron as linear threshold units Binary inputs x 0, 1 d, binary output, vector of weights w Rd ∈ { } ∈ (cid:26) 1 if w x > T f(x) = · 0 otherwise A unit can perform OR and AND operations Combine these units to represent any boolean function How to train them? 5 Perceptron: Rosenblatt (1957) wx+b=0 Input: retina x Rn ∈ Associative area: any kind of (fixed) function ϕ(x) Rd ∈ Decision function: (cid:26) 1 if w ϕ(x) > 0 f(x) = · 1 otherwise − Training: minimize (cid:80) max(0, yt wt ϕ(xt)), given (xt, yt) Rd 1, 1 t − · ∈ × {− } (cid:26) t t t t y ϕ(x ) if y w ϕ(x ) 0 t+1 t w = w + · ≤ 0 otherwise 6 Perceptron: Convergence (Novikoff, 1962) ∆ Cauchy-Schwarz (ρ = 2/ u )... max || || t t u w u w · ≤ || || || || 2 Assuming classes t w ≤ ρ || || max are separable u defines maximum 1 = margin separating hyperplane... x u R 0 x = u wt = u wt 1 + yt u xt u −1 · · − · = t 1 x u w + 1 u − ≥ · t ≥ When we do a “mistake”... t 2 t 1 2 t t 1 t t 2 w = w + 2y w x + x − − || || || || · || || 2/||u|| t 1 2 2 w + R − ≤ || || 2 t R ≤ We get: 2 4 R t 2 ≤ ρ max 7 Adaline: Widrow & Hoff (1960) Problems of the Perceptron: (cid:63) Separable case: does not find a hyperplane equidistant from the two classes (cid:63) Non-separable case: does not converge Adaline (Widrow & Hoff, 1960) minimizes 1 (cid:88) t t t 2 (y w ϕ(x )) 2 − · t Delta rule: t+1 t t t t t w = w + λ(y w x ) x − · 8 Perceptron: Margin See (Duda & Hart, 1973), (Krauth & M´ezard, 1987), (Collobert, 2004) Poor generalization capabilities in practice No control on the margin: 2 ρ max ρ = wT ≥ R2 || || (cid:80) t t t Margin Perceptron: minimize max(0, 1 y w ϕ(x )) t − · (cid:26) t t t t y ϕ(x ) if y w ϕ(x ) 1 t+1 t w = w + λ · ≤ 0 otherwise Finite number of updates: 4 2 2 t ( + R ) ≤ ρ2 λ max Control on the margin: 1 ρ ρ max 2 ≥ 2 + R λ 9 Perceptron: In Practice Original Perceptron (10/40/60 iter) 6 6 6 4 4 4 2 2 2 0 0 0 −2 −2 −2 −4 −4 −4 −6 −6 −6 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 Margin Perceptron (10/120/2000 iter) 6 6 6 4 4 4 2 2 2 0 0 0 −2 −2 −2 −4 −4 −4 −6 −6 −6 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 10
Description: