Csc 411 Lecture 10: Neural Networks I ethan Fetaya, James Lucas and Emad Andrews



Download 408,54 Kb.
bet1/2
Sana13.06.2022
Hajmi408,54 Kb.
#661818
  1   2
Bog'liq
lec10 handout


CSC 411 Lecture 10: Neural Networks I
Ethan Fetaya, James Lucas and Emad Andrews
University of Toronto
Today
Multi-layer Perceptron
Forward propagation
Backward propagation




Motivating Examples





Are You Excited about Deep Learning?
Linear classifiers (e.g., logistic regression) classify inputs based on linear combinations of features xi
Many decisions involve non-linear functions of the input
Canonical example: do 2 input elements have the same value?


The positive and negative cases cannot be separated by a plane What can we do?
How to Construct Nonlinear Classifiers?
We would like to construct non-linear discriminative classifiers that utilize functions of input variables
Use a large number of simpler functions

  • If these functions are fixed (Gaussian, sigmoid, polynomial basis functions), then optimization still involves linear combinations of (fixed functions of) the inputs

  • Or we can make these functions depend on additional parameters need an efficient method of training extra parameters

Inspiration: The Brain
Many machine learning methods inspired by biology, e.g., the (human) brain
о Our brain has ~ 1011 neurons, each of which communicates (is connected) to 104 other neurons











toward cell body

dendrites

axon

nucleus

< impulses carried away from cell body

cell body'

impulses carried

branches of axon

axon
■""terminals

Figure: The basic computational unit of the brain: Neuron



Mathematical Model of a Neuron


Neural networks define functions of the inputs (hidden features), computed by neurons
Artificial neurons are called units


Figure: A mathematical model of the neuron in a neural network
Activation Functions
Most commonly used activation functions:

Sigmoid:

^(z ) 1+exp(-z)



Tanh:

tanh(z) = exp(z)-exp(-z) ( ) exp(z)+exp(-z)


ReLU (Rectified Linear Unit): ReLU(z) = max(0, z)


Network with one layer of four hidden units:




output units

ООО

input units


Figure: Two different visualizations of a 2-layer neural network. In this example: 3 input units, 4 hidden units and 2 output units
Each unit computes its value based on linear combination of values of units that point into it, and an activation function
Network with one layer of four hidden units:




output units

ООО

input units


Figure: Two different visualizations of a 2-layer neural network. In this example: 3 input units, 4 hidden units and 2 output units
Naming conventions; a 2-layer neural network:

  • One layer of hidden units

  • One output layer

(we do not count the inputs as a layer)
[http://cs231n.github.io/neural-networks-1/]

Going deeper: a 3-layer neural network with two layers of hidden units

Figure: A 3-layer neural net with 3 input units, 4 hidden units in the first and second hidden layer and 1 output unit

Naming conventions; a N-layer neural network:

  • N 1 layers of hidden units

  • One output layer

Representational Power
Neural network with at least one hidden layer is a universal approximator (can represent any function).
Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko,paper


The capacity of the network increases with more hidden units and more hidden layers
Why go deeper (still kind of an open theory question)? One hidden layer might need exponential number of neurons, deep can be more compact.
Demo
Great tool to visualize networks http://playground.tensorflow.org/ Highly recommend playing with it!
Neural Networks
Two main phases:

  • Forward pass: Making predictions

  • Backward pass: Computing gradients

Forward Pass: What does the Network Compute?
Output of the network can be written as:
D
hj(x) = f (v о+ xi vji)
i=1
J
ok(x) = g (wk0 + hj(x)wkj)
j=1
( j indexing hidden units, k indexing the output units, D number of inputs)





a(Z) 1 + exp(—z),

tanh( z) =

exp(z) — exp(—z) exp(z) + exp(—z) ’

ReLU( z) = max(0, z)
Activation functions f, g: sigmoid/logistic, tanh, or rectified linear (ReLU)


What if we don't use any activation function?

Special Case
What is a single layer (no hiddens) network with a sigmoid act. function?


Network: 1
Ok (x) = -
k 1 + exp(-zk)
J zk = wk0 + xj wkj
j=1
Logistic regression!
Feedforward network
Feedforward network - Connections are a directed acyclic graphs (DAG) Layout can be more complicated than just k hidden layers.

How do we train?
We've seen how to compute predictions.
How do we train the network to make sensible predictions?
Training Neural Networks
How do we find weights?
N
w* = argmin loss(o(n), t(n))
w n=1
where o = f (x; w) is the output of a neural network
can use any (smooth) loss function we want.
Problem: With hidden units the objective is no longer convex!
No guarantees gradient methods won't end up in a (bad) local minima/ saddle point.
Some theory/experimental evidence that most local minimas are good, i.e. almost as good as the global minima.
SGD with some (critical) tweaks works well. It is not really well understood.
Training Neural Networks: Back-propagation
Back-propagation: an efficient method for computing gradients needed to perform gradient-based optimization of the weights in a multi-layer network
Training neural nets: Loop until convergence:
for each example n

  1. Given input x(n) , propagate activity forward (x(n) h(n) o(n))

( forward pass )

  1. Propagate gradients backward (backward pass)

  2. Update each weight (via gradient descent)

Given any error function E, activation functions g () and f (), just need to derive gradients
Key Idea behind Backpropagation
We don't have targets for a hidden unit, but we can compute how fast the error changes as we change its activity

  • Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities

  • Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined

  • We can compute error derivatives for all the hidden units efficiently

  • Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit

This is just the chain rule!
Useful Derivatives

name

function

derivative

Sigmoid

&(z ) 1+exp(-z )

a(z) (1 - a(z))

Tanh

tanh(z) = exp(z)-exp(-z) ( ) exp(z)+exp(-z)

1/ cosh2(z)







1, if z > 0

ReLU

ReLU( z) = max(0, z)

0, if z < 0

о Let's take a single layer network and draw it a bit differently


Input layer

Output layer


Output of unit к
Output layer activation function
Net input to output unit к
Weight from input i to к
Input unit i



о Error gradients for single layer network:
dE dE dok dzk
dwki dok dzk dwki
Error gradient is computable for any smooth activation function g(), and any smooth error function




Error gradients for single layer network:

dE
d Wki

dE dok dzk d ok dzk d wki






Error gradients for single layer network:

dE
d Wki
dE dok dZk = ^od£k dzk
d ok dzk dwki k dzk d wki

Error gradients for single layer network:
dE = dE dok dZk = , dok dzk
d wki d ok dzk dwki k dzk d wki
Computing Gradients: Single Layer Network



Output layer


Input layer


dzk
dwki




Error gradients for single layer network:



d E
dwki


dE dok dzk dok dzk d wki


= 6k


dzk


d Wki


= 6kxi





Gradient Descent for Single Layer Network
Assuming the error function is mean-squared error (MSE), on a single training example n, we have







d E
d °(n)

(n) = ok(n)

- t(n) := 5

Using logistic activation functions:

ok(n)
d z(n)

g (zk(n) ) = (1 + exp(-zk(n) ))-1 ok(n) (1 - ok(n) )



The error gradient is then:



d E = A d E do(n) dzkn) d wki 1= d o(n) dz(n) d wki

N
E(okn) - t(n)(n)(1 - °(n))x(n) n=1


The gradient descent update rule is given by:
N
E (n) Jn) (n) (n) (n)
wki wki - 4(hw = wki - n (ok - tk )k )(1 - °k ))xi
ki n=1
Output of unit к
Output layer activation function
Net input to output unit к
Weight from hidden unit j to output к
Output of hidden unit j
Hidden layer activation function
Net input to unit j
Weight from input i to j
Input unit i

Back-propagation: Sketch on One Training Case


d e
ok tk
d Ok
Convert discrepancy between each output and its target value into an error derivative


tk)2;
E = 2 E(0k -
k
Compute error derivatives in each hidden layer from error derivatives in layer above. [assign blame for error at k to each unit j according to its influence on k (depends on wkj )]


Use error derivatives w.r.t. activities to get error derivatives w.r.t. the weights.
d E = d E d ok d Zk} = $z ,(n) h(n)
d wkj n=1 do(n) dz(n) d wkj n=1 k j
where 5k is the error w.r.t. the net input for unit k
Hidden weight gradients are then computed via back-prop:
d E =
d h(n) =
d E = d E d ok d Zk} = ^k ,(n) h(n)
d wkj n=1 do(n) dz(n) d wkj n=1 k j
where 6k is the error w.r.t. the net input for unit k
Hidden weight gradients are then computed via back-prop:
dE = dE= 5z,(n) w . .= ^.(n)
dhjn) / do(n) dz(n) dh(n) / k j j
d E = d E d Ok d Zk} = $z ,(n) h(n)
d wkj £1 do(n) dz(n) d wkj n=1 k j
where 5k is the error w.r.t. the net input for unit k
Hidden weight gradients are then computed via back-prop:
d E = d E d Ok d Zk"} = г z ,(n) := . h,(n)
d j / do(n) dz(n) dh(n) / k j j

d E
d v
ji

* dE dh(n) du(n) n=J dh(n) d u(n) d vji
N d
X r,f. j) dj. =
n=1 ji
E = d E d °k d zk^ = 5z ,(n) h(n)
dwkJ n=1 d°kn) dz(n) dwkJ n=1 k J
where 8k is the error w.r.t. the net input for unit k
Hidden weight gradients are then computed via back-prop:

CSC 411 Lecture 10: Neural Networks I 1
Today 2
How to Construct Nonlinear Classifiers? 7
Inspiration: The Brain 8
Mathematical Model of a Neuron 9
Activation Functions 11
Representational Power 15
Demo 16
Neural Networks 17
Forward Pass: What does the Network Compute? 18
Special Case 19
Feedforward network 20
How do we train? 21
Training Neural Networks 21
Training Neural Networks: Back-propagation 23
Key Idea behind Backpropagation 24
Useful Derivatives 25
Computing Gradients: Single Layer Network 31
Gradient Descent for Single Layer Network 33
Back-propagation: Sketch on One Training Case 35
Training neural networks 44
Activation functions 45
Initialization 46
Momentum 47


dE

Download 408,54 Kb.

Do'stlaringiz bilan baham:
  1   2




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish