Analisi di Immagini e Video (Computer Vision) · 2021. 3. 11. · •Computer Vision (I....

Post on 20-Apr-2021

0 views 0 download

Transcript of Analisi di Immagini e Video (Computer Vision) · 2021. 3. 11. · •Computer Vision (I....

Analisi di Immagini e Video(Computer Vision)

Giuseppe Manco

Outline

• Reti Neurali• CNN

Crediti

• Slides adattate da vari corsi e libri• Deep Learning (Ettore Ritacco)• Deep Learning (Bengio, Courville, Goodfellow, 2017)• Andrey Karpathy• Computer Vision (I. Gkioulekas) - CS CMU Edu• Cmputational Visual Recognition (V. Ordonez), CS Virgina Edu

Oltre i modelli lineari

• A biological neuron is a cell connected to other neurons and acts as a hub for electrical impulses. A neuron has a roughly spherical cell body called soma, whichprocesses the incoming signals and converts them into output signals. Input signals are collected from extensions on the cell body called dendrites. Theseoutput signals are transmitted to other neurons through another extension calledaxon, which prolongs from the cell body and terminates into several branches. The branches end up into junctions transmitting signals from one neuron to another, called synapses. • The behavior of a neuron is essentially electro-chemical. An electrical potential

difference is maintained between the inside and the outside of the soma, due to different concentrations of sodium (Na) and potassium (K) ions. When a neuronreceives inputs from a large number of neurons via its synaptic connections, there is a change in the soma potential. If this change is above a given threshold, it results in an electric current flowing through the axon to other cells. Then the potential drops down below the resting potential and neuron cannot fire againuntil the resting potential is restored.

Deep Learning

• Parte del machine learning• Apprende rappresentazioni dei dati• Utilizza una gerarchia di layers che imitano il

comportamento dei neuroni nel cervello

Senza feature engineering

Input Data Featureengineering

TraditionalLearning algorithm

Input DataDeep

Learning algorithm

Representation learning

Input Pixels

Features concatenazione

SVM

Linear Classifier

Ans

Esonero

Representation learning

Input Pixels

Ans

Gli strati della rete apprendono le features automaticamente

(GoogLeNet)

Y LeCunMA Ranzato

The Mammalian Visual Cortex is Hierarchical

[picture from Simon Thorpe]

[Gallant & Van Essen]

The ventral (recognition) pathway in the visual cortex has multiple stagesRetina - LGN - V1 - V2 - V4 - PIT - AIT ....Lots of intermediate representations

• The ventral (recognition) pathway in the visual cortex has multiple stages • Retina - LGN - V1 - V2 - V4 - PIT - AIT ....

• Lots of intermediate representations

Perceptron Learning• Funzione di base:

• Classificazione binaria• Separabilità lineare

• Due estensioini: • K classi• Relazioni nonlineari

𝑦 = 𝜎 𝑎

a =&!"#

$

𝑤!𝑥! + 𝑏

𝜎 𝑎 =1

1 + 𝑒%&

Estensioni• K classi

• Relazioni nonlineari

𝑦' = 𝑓 𝑎'

𝑎' =&!"#

$

𝑤'!𝑥! + 𝑏'

𝒂 = 𝑾𝒙 + 𝒃

𝒚 = 𝑓 𝒂

𝑎' =&!"#

$

𝑤'!𝜙(𝑥!) + 𝑏'

𝒂 = 𝑾𝜙(𝒙) + 𝒃

Format generale

• Una ennupla:

𝑛𝑒𝑡 = 𝑔, 𝑙, 𝑜, 𝑖, fpp

𝑔 Il grafo… network topology and operator

𝑔 il grafo

• 𝑔 = 𝑁, 𝐸 è un grafo diretto pesato

• Ogni nodo 𝑖 ∈ 𝑁 è un perceptron• È caratterizzato da due elementi

• Un valore 𝑎!, (l’attivazione)• Una funzione di attivazione 𝑓!

• Applicata all’attivazione, produce l’output 𝑧!

• Un arco 𝑒 = 𝑗 ∈ 𝑁 → 𝑖 ∈ 𝑁 ∈ 𝐸 è associato a un peso 𝑤#$

• Ogni nodo 𝑖 è associato ad un arco speciale ad un nodo fantasma, il cui peso 𝑏$ è chiamato bias

𝑔 Il grafo

• Ogni neurone è una unità di calcolo

𝑧! = 𝑓! 𝑎!

𝑎! = 𝑏! + &(:(→!∈,

𝑤(!𝑧(𝑖

𝑤(!

𝑔 Il grafo

• Tre categorie di nodi

• Input• I valori sono ”sovrascritti” dall’esterno

• Hidden• Unità di calcolo

• Output• Forniscono valori all’esterno

𝑥#

𝑥-

𝑔 Il grafo

• Combinazione di neuroni connessi → Calcolocomplesso → Operazione• Nodi che condividono lo stesso input sono strutturati

in layers

𝑦 = 𝑧. = 𝑓. 𝑏. + &(:(→.∈,

𝑤(.𝑧(

𝑧/ = 𝑓/ 𝑏/ + &(:(→/∈,

𝑤(/𝑧(

𝑧0 = 𝑓0 𝑏0 + &(:(→0∈,

𝑤(0𝑧(

𝑧1 = 𝑓1 𝑏1 + &(:(→1∈,

𝑤(1𝑧(

𝑧# = 𝑥#

𝑧- = 𝑥-

𝑦 = 𝑓. 𝑏. + 𝑤1,.𝑓1 𝑏1 + 𝑤#,1𝑥# + 𝑤-,1𝑥- + 𝑤0,.𝑓0 𝑏0 + 𝑤#,0𝑥# + 𝑤-,0𝑥- + 𝑤/,.𝑓/ 𝑏/ + 𝑤#,/𝑥# + 𝑤-,/𝑥-

𝑔 Il grafo

• Notazione compatta:• Dati due layer consecutivi 𝑘 e ℎ:• 𝒛𝒌 = 𝑓9 𝒃𝒉 +𝑾𝒛𝒉

• Nota: tutti i nodi condividono la stessa funzione di attivazione 𝑓!• 𝑊 è la matrice dei pesi associate agli archi

ℎ 𝑘

Feed-Forward NetworksComponents

… …

…… …

Hidden variables ℎ1 ℎ ℎ

𝑦

Input 𝑥

First layer Output layer

𝑥 𝑧# 𝑧- 𝑧'

Input• Rappresentato come

vettoreInput

• Represented as a vector

• Sometimes require somepreprocessing, e.g.,

• Subtract mean• Normalize to [-1,1]

Expand

Output Layers

• Regressione:

Output layers

• Regression: 𝑦 = 𝑤 ℎ + 𝑏• Linear units: no nonlinearity

𝑦

Output layer

𝑦 = 𝑾3𝒛 + 𝒃

𝑧

Output layers

• Regressione multidimensionale:

Output layers

• Multi-dimensional regression: 𝑦 = 𝑊 ℎ + 𝑏• Linear units: no nonlinearity

𝑦

Output layer

𝒚 = 𝑾3𝒛 + 𝒃

𝑧

Output layers

• Classificazione binaria:

• Regressione logistica su z

Output layers

• Binary classification: 𝑦 = 𝜎(𝑤 ℎ + 𝑏)• Corresponds to using logistic regression on ℎ

𝑦

Output layer

𝑦 = 𝜎 𝑾3𝒛 + 𝒃

𝑧

Output layers

• Classificazione multiclasse:

Output layers

• Multi-class classification: • 푦 = softmax 푧 where 푧 = 𝑊 ℎ + 푏• Corresponds to using multi-class

logistic regression on ℎ

Output layer

𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊4𝑧 + 𝑏

𝑠𝑜𝑓𝑡𝑚𝑎𝑥((𝑎) =𝑒&!

∑!"#' 𝑒&"=

𝑒&!%&#$%

∑!"#' 𝑒&"%&#$%

𝑧

Hidden layers• Ogni neurone è una

combinazione dei layerprecedenti

Hidden layers

• Neuron take weighted linear combination of the previous layer

• So can think of outputting one value for the next layer

……

ℎ ℎ +1

𝒛𝒊6𝟏 = 𝑓!(𝑾𝒊8𝒛𝒊 + 𝒃𝒊)

𝑧! 𝑧!6#

Funzioni di attivazione

• La forma di 𝑓" ha una grossa influenza sui risultati• Storicamente, 𝑓(𝑎) ha preso due forme

𝜎 𝑎 =1

1 + 𝑒%& 𝑡𝑎𝑛ℎ 𝑎 =1 − 𝑒%-&

1 + 𝑒%-&

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

Forward pass

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

Forward pass

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Forward pass

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

𝑒 = 1

Backward pass

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑑𝜕𝑏 = −1/𝑏

!

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑑𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑑𝜕𝑏 = −0.25

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −0.25

𝜕𝑒𝜕𝑎

=?

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −0.25

𝜕𝑒𝜕𝑎

=𝜕𝑒𝜕𝑐⋅𝜕𝑐𝜕𝑎

Chain rule!

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −0.25

𝜕𝑒𝜕𝑎

=𝜕𝑒𝜕𝑐⋅𝜕𝑐𝜕𝑎

= 0.5

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −0.25

𝜕𝑒𝜕𝑏

=?

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −0.25

𝜕𝑒𝜕𝑏

=?

Distribution rule!

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −0.25

𝜕𝑒𝜕𝑏

=𝜕𝑒𝜕𝑐⋅𝜕𝑐𝜕𝑏

+𝜕𝑒𝜕𝑑

⋅𝜕𝑑𝜕𝑏

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −0.25

𝜕𝑒𝜕𝑏

= 0.5 ⋅ 0.5 − 2 ⋅ 0.25

Gradient Computing

𝑎 𝑏

𝑐 = 𝑎 + 𝑏/2 𝑑 = 1/𝑏

e= 𝑐 ∗ 𝑑

𝑎 = 1 𝑏 = 2

c= 2 d= 0.5

e= 1

Backward pass

𝜕𝑒𝜕𝑐 = 𝑑

𝜕𝑒𝜕𝑑 = 𝑐

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −1/𝑏

!

𝜕𝑒𝜕𝑐 = 0.5

𝜕𝑒𝜕𝑑 = 2

𝜕𝑐𝜕𝑎 = 1

𝜕𝑐𝜕𝑏 = 0.5

𝜕𝑐𝜕𝑏 = −0.25

𝜕𝑒𝜕𝑏

= −0.75

Patterns nel flow dei gradienti

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH����� $SULO�����������

DGG�JDWH��JUDGLHQW�GLVWULEXWRU

3DWWHUQV�LQ�JUDGLHQW�IORZ

��

���

PXO�JDWH��³VZDS�PXOWLSOLHU´

PD[�JDWH��JUDGLHQW�URXWHU

PD[

FRS\�JDWH��JUDGLHQW�DGGHU

î�

���

� � ��

� � ��

���

��

��� ��

Derivate vettoriali sui nodi

• Input scalare, output scalare

• Derivata

𝑥 ∈ ℝ, 𝑦 ∈ ℝ

𝜕𝑦𝜕𝑥

∈ ℝ

Derivate vettoriali sui nodi

• Input vettore, output reale

• Gradiente

𝑥 ∈ ℝ# , 𝑦 ∈ ℝ

𝜕𝑦𝜕𝑥

∈ ℝ$

𝜕𝑦𝜕𝑥 $

=𝜕𝑦𝜕𝑥$

Derivate vettoriali sui nodi

• Input vettore, output vettore

• Jacobiano

𝑥 ∈ ℝ# , 𝑦 ∈ ℝ%

𝜕𝑦𝜕𝑥

∈ ℝ#×%

𝜕𝑦𝜕𝑥 $,(

=𝜕𝑦(𝜕𝑥$

Derivate vettoriali sui nodi

𝑥 ∈ ℝ9

𝑦 ∈ ℝ:

𝑧 ∈ ℝ;

Derivate vettoriali sui nodi

𝑥 ∈ ℝ9

𝑦 ∈ ℝ:

𝑧 ∈ ℝ;𝜕𝑜𝜕𝑧

∈ ℝ;×3

𝜕𝑧𝜕𝑥 ∈ ℝ

9×;

𝜕𝑧𝜕𝑥 ∈ ℝ

9×:

Derivate vettoriali sui nodi

𝑥 ∈ ℝ9

𝑦 ∈ ℝ:

𝑧 ∈ ℝ;𝜕𝑜𝜕𝑧

∈ ℝ;×3

𝜕𝑧𝜕𝑥 ∈ ℝ

9×;

𝜕𝑧𝜕𝑥 ∈ ℝ

9×:

=>=?= =@

=?⋅ =>=@∈ ℝ9×3

=>=A= =@

=A⋅ =>=@∈ ℝ:×3

Esercizio

• Disegnare il grafo• Calcolare i gradienti

Funzioni di attivazione e gradienti

• Problema: saturazione

Gradiente nullo!

Gradiente nullo

Gradiente evanescente

𝑥 𝑧" = 𝜎(𝑥 ⋅ 𝑤")

𝑤"

𝑧! = 𝜎(𝑧" ⋅ 𝑤!)

𝑤!

𝑧# = 𝜎(𝑧! ⋅ 𝑤#)

𝑤#

𝑧$ = 𝜎(𝑧# ⋅ 𝑤$)

𝑤$

𝑦 = 𝜎(𝑧$ ⋅ 𝑤%)

𝑤%

Gradiente evanescente

𝑥 𝑧" = 𝜎(𝑥 ⋅ 𝑤")

𝑤"

𝑧! = 𝜎(𝑧" ⋅ 𝑤!)

𝑤!

𝑧# = 𝜎(𝑧! ⋅ 𝑤#)

𝑤#

𝑧$ = 𝜎(𝑧# ⋅ 𝑤$)

𝑤$

𝑦 = 𝜎(𝑧$ ⋅ 𝑤%)

𝑤%

2

1 0.1 1.2 −.5 1

Gradiente evanescente

𝑥 𝑧" = 𝜎(𝑥 ⋅ 𝑤")

𝑤"

𝑧! = 𝜎(𝑧" ⋅ 𝑤!)

𝑤!

𝑧# = 𝜎(𝑧! ⋅ 𝑤#)

𝑤#

𝑧$ = 𝜎(𝑧# ⋅ 𝑤$)

𝑤$

𝑦 = 𝜎(𝑧$ ⋅ 𝑤%)

𝑤%

2

1 0.1 1.2 −.5 1

.52 0.65 0.410.88 0.60

Gradiente evanescente

𝑥 𝑧" = 𝜎(𝑥 ⋅ 𝑤")

𝑤"

𝑧! = 𝜎(𝑧" ⋅ 𝑤!)

𝑤!

𝑧# = 𝜎(𝑧! ⋅ 𝑤#)

𝑤#

𝑧$ = 𝜎(𝑧# ⋅ 𝑤$)

𝑤$

𝑦 = 𝜎(𝑧$ ⋅ 𝑤%)

𝑤%

2

1 0.1 1.2 −.5 1

.52 0.65 0.410.88 0.60

𝜕𝑦𝜕𝑤R

=?

Gradiente evanescente

𝑥 𝑧" = 𝜎(𝑥 ⋅ 𝑤")

𝑤"

𝑧! = 𝜎(𝑧" ⋅ 𝑤!)

𝑤!

𝑧# = 𝜎(𝑧! ⋅ 𝑤#)

𝑤#

𝑧$ = 𝜎(𝑧# ⋅ 𝑤$)

𝑤$

𝑦 = 𝜎(𝑧$ ⋅ 𝑤%)

𝑤%

2

1 0.1 1.2 −.5 1

.52 0.65 0.410.88 0.60

𝜕𝑦𝜕𝑤R

= 𝜎S 𝑧T 𝑤U𝜎S 𝑧V 𝑤T𝜎S 𝑧W 𝑤V𝜎′(𝑧R)𝑤W𝜎′(𝑥 ⋅ 𝑤R)𝑥

Vanishing Gradient

• Forward pass

• Backward pass

a(h+1) = W(h)z(h)

z(h+1) = �⇣a(h+1)

z(0) = x<latexit sha1_base64="lcjygqVSVlavy8XVJ/KLK0/c0OQ=">AAACjXicbVHbTttAEF2bcmm4BfrYl1WjoERIwUYgeKAVog/tI5XIRYpDNF4myYr1RbtjBFj+kH4aD3wLtd1EbRLm6cyZOXtmZ/xYSUOO82LZKx9W19Y3PlY2t7Z3dqt7+x0TJVpgW0Qq0j0fDCoZYpskKezFGiHwFXb9++9FvfuA2sgovKGnGAcBjEM5kgIop4bV314ANPFHKWS3aWNy6DYzfsC/8hndLelmNsufZ7lXmaf+KY0cB+ApHFFj+XFPy/GEmotyZ8H2MasMqzWn5ZTBl4E7BTU2jeth9dW7i0QSYEhCgTF914lpkIImKRRmFS8xGIO4hzH2cxhCgGaQlivMeD0xQBGPUXOpeEni/4oUAmOeAj/vLEY0i7WCfK/WT2h0PkhlGCeEoSiMSCosjYzQMr8N8jupkQiKyZHLkAvQQIRachAiJ5P8WMU+3MXfL4POcct1Wu6vk9rl1XQzG+wz+8IazGVn7JL9ZNeszQR7s+pWyzqyd+1T+8L+9rfVtqaaT2wu7B9/AFnuxOc=</latexit><latexit sha1_base64="lcjygqVSVlavy8XVJ/KLK0/c0OQ=">AAACjXicbVHbTttAEF2bcmm4BfrYl1WjoERIwUYgeKAVog/tI5XIRYpDNF4myYr1RbtjBFj+kH4aD3wLtd1EbRLm6cyZOXtmZ/xYSUOO82LZKx9W19Y3PlY2t7Z3dqt7+x0TJVpgW0Qq0j0fDCoZYpskKezFGiHwFXb9++9FvfuA2sgovKGnGAcBjEM5kgIop4bV314ANPFHKWS3aWNy6DYzfsC/8hndLelmNsufZ7lXmaf+KY0cB+ApHFFj+XFPy/GEmotyZ8H2MasMqzWn5ZTBl4E7BTU2jeth9dW7i0QSYEhCgTF914lpkIImKRRmFS8xGIO4hzH2cxhCgGaQlivMeD0xQBGPUXOpeEni/4oUAmOeAj/vLEY0i7WCfK/WT2h0PkhlGCeEoSiMSCosjYzQMr8N8jupkQiKyZHLkAvQQIRachAiJ5P8WMU+3MXfL4POcct1Wu6vk9rl1XQzG+wz+8IazGVn7JL9ZNeszQR7s+pWyzqyd+1T+8L+9rfVtqaaT2wu7B9/AFnuxOc=</latexit><latexit sha1_base64="lcjygqVSVlavy8XVJ/KLK0/c0OQ=">AAACjXicbVHbTttAEF2bcmm4BfrYl1WjoERIwUYgeKAVog/tI5XIRYpDNF4myYr1RbtjBFj+kH4aD3wLtd1EbRLm6cyZOXtmZ/xYSUOO82LZKx9W19Y3PlY2t7Z3dqt7+x0TJVpgW0Qq0j0fDCoZYpskKezFGiHwFXb9++9FvfuA2sgovKGnGAcBjEM5kgIop4bV314ANPFHKWS3aWNy6DYzfsC/8hndLelmNsufZ7lXmaf+KY0cB+ApHFFj+XFPy/GEmotyZ8H2MasMqzWn5ZTBl4E7BTU2jeth9dW7i0QSYEhCgTF914lpkIImKRRmFS8xGIO4hzH2cxhCgGaQlivMeD0xQBGPUXOpeEni/4oUAmOeAj/vLEY0i7WCfK/WT2h0PkhlGCeEoSiMSCosjYzQMr8N8jupkQiKyZHLkAvQQIRachAiJ5P8WMU+3MXfL4POcct1Wu6vk9rl1XQzG+wz+8IazGVn7JL9ZNeszQR7s+pWyzqyd+1T+8L+9rfVtqaaT2wu7B9/AFnuxOc=</latexit><latexit sha1_base64="lcjygqVSVlavy8XVJ/KLK0/c0OQ=">AAACjXicbVHbTttAEF2bcmm4BfrYl1WjoERIwUYgeKAVog/tI5XIRYpDNF4myYr1RbtjBFj+kH4aD3wLtd1EbRLm6cyZOXtmZ/xYSUOO82LZKx9W19Y3PlY2t7Z3dqt7+x0TJVpgW0Qq0j0fDCoZYpskKezFGiHwFXb9++9FvfuA2sgovKGnGAcBjEM5kgIop4bV314ANPFHKWS3aWNy6DYzfsC/8hndLelmNsufZ7lXmaf+KY0cB+ApHFFj+XFPy/GEmotyZ8H2MasMqzWn5ZTBl4E7BTU2jeth9dW7i0QSYEhCgTF914lpkIImKRRmFS8xGIO4hzH2cxhCgGaQlivMeD0xQBGPUXOpeEni/4oUAmOeAj/vLEY0i7WCfK/WT2h0PkhlGCeEoSiMSCosjYzQMr8N8jupkQiKyZHLkAvQQIRachAiJ5P8WMU+3MXfL4POcct1Wu6vk9rl1XQzG+wz+8IazGVn7JL9ZNeszQR7s+pWyzqyd+1T+8L+9rfVtqaaT2wu7B9/AFnuxOc=</latexit>

@`

@z(h)=

⇣W(h)

⌘T @`

@a(h+1)

@`

@a(h)=

@`

@z(h)� �0(a(h))

<latexit sha1_base64="WS1qNFbSx4gA55h+ILZW2WAcGxQ=">AAADAniclVJNb9NAEF2bQouhNIVjL6tGQCKkyEZI5VKpKheOrdQ0lbJpNN6Mk1XXH9odI7WWb/wKru2pN8SVP9ID/6W2MVJJOdA5vX1v3szs7IaZVpZ8/8ZxH608frK69tR79nz9xUZn8+WxTXMjcShTnZqTECxqleCQFGk8yQxCHGochWefan30BY1VaXJE5xlOYpgnKlISqKKmm866iAzIQmRgSIHmArUu7xxjoEUYFRfladFb9MuSe2/4LhcaI+r9EUetKIyaL6h/euT9X1FofO+CuqwQDzLVlmaQB04v0llKXFg1j+Ftb7lmf9rp+gO/CX4fBC3osjYOpp1fYpbKPMaEpAZrx4Gf0aSo+0uNpSdyixnIM5jjuIIJxGgnRfNuJX+dW6CUZ2i40rwh8a6jgNja8zisMutB7bJWk//SxjlFHyeFSrKcMJF1I1Iam0ZWGlV9COQzZZAI6smRq4RLMECERnGQsiLz6od41T6C5dvfB8fvB4E/CA4/dPf2282ssS22zXosYDtsj31mB2zIpEPON+fSuXK/utfud/fH71TXaT2v2F/h/rwFdFv2EQ==</latexit><latexit sha1_base64="WS1qNFbSx4gA55h+ILZW2WAcGxQ=">AAADAniclVJNb9NAEF2bQouhNIVjL6tGQCKkyEZI5VKpKheOrdQ0lbJpNN6Mk1XXH9odI7WWb/wKru2pN8SVP9ID/6W2MVJJOdA5vX1v3szs7IaZVpZ8/8ZxH608frK69tR79nz9xUZn8+WxTXMjcShTnZqTECxqleCQFGk8yQxCHGochWefan30BY1VaXJE5xlOYpgnKlISqKKmm866iAzIQmRgSIHmArUu7xxjoEUYFRfladFb9MuSe2/4LhcaI+r9EUetKIyaL6h/euT9X1FofO+CuqwQDzLVlmaQB04v0llKXFg1j+Ftb7lmf9rp+gO/CX4fBC3osjYOpp1fYpbKPMaEpAZrx4Gf0aSo+0uNpSdyixnIM5jjuIIJxGgnRfNuJX+dW6CUZ2i40rwh8a6jgNja8zisMutB7bJWk//SxjlFHyeFSrKcMJF1I1Iam0ZWGlV9COQzZZAI6smRq4RLMECERnGQsiLz6od41T6C5dvfB8fvB4E/CA4/dPf2282ssS22zXosYDtsj31mB2zIpEPON+fSuXK/utfud/fH71TXaT2v2F/h/rwFdFv2EQ==</latexit><latexit sha1_base64="WS1qNFbSx4gA55h+ILZW2WAcGxQ=">AAADAniclVJNb9NAEF2bQouhNIVjL6tGQCKkyEZI5VKpKheOrdQ0lbJpNN6Mk1XXH9odI7WWb/wKru2pN8SVP9ID/6W2MVJJOdA5vX1v3szs7IaZVpZ8/8ZxH608frK69tR79nz9xUZn8+WxTXMjcShTnZqTECxqleCQFGk8yQxCHGochWefan30BY1VaXJE5xlOYpgnKlISqKKmm866iAzIQmRgSIHmArUu7xxjoEUYFRfladFb9MuSe2/4LhcaI+r9EUetKIyaL6h/euT9X1FofO+CuqwQDzLVlmaQB04v0llKXFg1j+Ftb7lmf9rp+gO/CX4fBC3osjYOpp1fYpbKPMaEpAZrx4Gf0aSo+0uNpSdyixnIM5jjuIIJxGgnRfNuJX+dW6CUZ2i40rwh8a6jgNja8zisMutB7bJWk//SxjlFHyeFSrKcMJF1I1Iam0ZWGlV9COQzZZAI6smRq4RLMECERnGQsiLz6od41T6C5dvfB8fvB4E/CA4/dPf2282ssS22zXosYDtsj31mB2zIpEPON+fSuXK/utfud/fH71TXaT2v2F/h/rwFdFv2EQ==</latexit><latexit sha1_base64="WS1qNFbSx4gA55h+ILZW2WAcGxQ=">AAADAniclVJNb9NAEF2bQouhNIVjL6tGQCKkyEZI5VKpKheOrdQ0lbJpNN6Mk1XXH9odI7WWb/wKru2pN8SVP9ID/6W2MVJJOdA5vX1v3szs7IaZVpZ8/8ZxH608frK69tR79nz9xUZn8+WxTXMjcShTnZqTECxqleCQFGk8yQxCHGochWefan30BY1VaXJE5xlOYpgnKlISqKKmm866iAzIQmRgSIHmArUu7xxjoEUYFRfladFb9MuSe2/4LhcaI+r9EUetKIyaL6h/euT9X1FofO+CuqwQDzLVlmaQB04v0llKXFg1j+Ftb7lmf9rp+gO/CX4fBC3osjYOpp1fYpbKPMaEpAZrx4Gf0aSo+0uNpSdyixnIM5jjuIIJxGgnRfNuJX+dW6CUZ2i40rwh8a6jgNja8zisMutB7bJWk//SxjlFHyeFSrKcMJF1I1Iam0ZWGlV9COQzZZAI6smRq4RLMECERnGQsiLz6od41T6C5dvfB8fvB4E/CA4/dPf2282ssS22zXosYDtsj31mB2zIpEPON+fSuXK/utfud/fH71TXaT2v2F/h/rwFdFv2EQ==</latexit>

Vanishing gradient

• Conseguenza:

• Il gradiente «svanisce» esponenzialmente con la profondità della rete se i pesi sono ill-conditioned o le attivazioni sono nel dominio di saturazione di σ.

@`

@z(h)=

⇣W(h)

⌘T✓�0(a(h))� @`

@z(h+1)

<latexit sha1_base64="byoO+oTUusmnbnTKlB1Kgqk0Wss=">AAACp3iclVFNT9tAEF27tFBTSoBT1cuKqCVRpchGSHBBILjACSqRDykO0XgzTlZZf2h3jEQt/yp+DQf+C3ZwER+9dE5v33ujNzsTpEoact17y/6w9PHT8spnZ/XL2tf1xsZmzySZFtgViUr0IACDSsbYJUkKB6lGiAKF/WB+Wun9G9RGJvEV3aY4imAay1AKoJIaN+78UIPI/RQ0SVDcR6WKF88IaBaE+Z/iOm/N2kXBnZ/8kPsKQ2r9Ffu16Gs5nVH7+sqpdSOnEew8+6D2tR0/mSTE/yf6l1eG1wHjRtPtuIvi74FXgyar63LcePAnicgijEkoMGbouSmN8ipIKCwcPzOYgpjDFIcljCFCM8oXyy34j8wAJTxFzaXiCxJfduQQGXMbBaWzmti81SryX9owo/BglMs4zQhjUQWRVLgIMkLL8mrIJ1IjEVSTI5cxF6CBCLXkIERJZuUZnXIf3tvfvwe93Y7ndrzfe83jk3ozK+w722Yt5rF9dszO2CXrMmF9s46sM+vcbtsXds8ePFltq+7ZYq/Khke5v9GH</latexit><latexit sha1_base64="byoO+oTUusmnbnTKlB1Kgqk0Wss=">AAACp3iclVFNT9tAEF27tFBTSoBT1cuKqCVRpchGSHBBILjACSqRDykO0XgzTlZZf2h3jEQt/yp+DQf+C3ZwER+9dE5v33ujNzsTpEoact17y/6w9PHT8spnZ/XL2tf1xsZmzySZFtgViUr0IACDSsbYJUkKB6lGiAKF/WB+Wun9G9RGJvEV3aY4imAay1AKoJIaN+78UIPI/RQ0SVDcR6WKF88IaBaE+Z/iOm/N2kXBnZ/8kPsKQ2r9Ffu16Gs5nVH7+sqpdSOnEew8+6D2tR0/mSTE/yf6l1eG1wHjRtPtuIvi74FXgyar63LcePAnicgijEkoMGbouSmN8ipIKCwcPzOYgpjDFIcljCFCM8oXyy34j8wAJTxFzaXiCxJfduQQGXMbBaWzmti81SryX9owo/BglMs4zQhjUQWRVLgIMkLL8mrIJ1IjEVSTI5cxF6CBCLXkIERJZuUZnXIf3tvfvwe93Y7ndrzfe83jk3ozK+w722Yt5rF9dszO2CXrMmF9s46sM+vcbtsXds8ePFltq+7ZYq/Khke5v9GH</latexit><latexit sha1_base64="byoO+oTUusmnbnTKlB1Kgqk0Wss=">AAACp3iclVFNT9tAEF27tFBTSoBT1cuKqCVRpchGSHBBILjACSqRDykO0XgzTlZZf2h3jEQt/yp+DQf+C3ZwER+9dE5v33ujNzsTpEoact17y/6w9PHT8spnZ/XL2tf1xsZmzySZFtgViUr0IACDSsbYJUkKB6lGiAKF/WB+Wun9G9RGJvEV3aY4imAay1AKoJIaN+78UIPI/RQ0SVDcR6WKF88IaBaE+Z/iOm/N2kXBnZ/8kPsKQ2r9Ffu16Gs5nVH7+sqpdSOnEew8+6D2tR0/mSTE/yf6l1eG1wHjRtPtuIvi74FXgyar63LcePAnicgijEkoMGbouSmN8ipIKCwcPzOYgpjDFIcljCFCM8oXyy34j8wAJTxFzaXiCxJfduQQGXMbBaWzmti81SryX9owo/BglMs4zQhjUQWRVLgIMkLL8mrIJ1IjEVSTI5cxF6CBCLXkIERJZuUZnXIf3tvfvwe93Y7ndrzfe83jk3ozK+w722Yt5rF9dszO2CXrMmF9s46sM+vcbtsXds8ePFltq+7ZYq/Khke5v9GH</latexit><latexit sha1_base64="byoO+oTUusmnbnTKlB1Kgqk0Wss=">AAACp3iclVFNT9tAEF27tFBTSoBT1cuKqCVRpchGSHBBILjACSqRDykO0XgzTlZZf2h3jEQt/yp+DQf+C3ZwER+9dE5v33ujNzsTpEoact17y/6w9PHT8spnZ/XL2tf1xsZmzySZFtgViUr0IACDSsbYJUkKB6lGiAKF/WB+Wun9G9RGJvEV3aY4imAay1AKoJIaN+78UIPI/RQ0SVDcR6WKF88IaBaE+Z/iOm/N2kXBnZ/8kPsKQ2r9Ffu16Gs5nVH7+sqpdSOnEew8+6D2tR0/mSTE/yf6l1eG1wHjRtPtuIvi74FXgyar63LcePAnicgijEkoMGbouSmN8ipIKCwcPzOYgpjDFIcljCFCM8oXyy34j8wAJTxFzaXiCxJfduQQGXMbBaWzmti81SryX9owo/BglMs4zQhjUQWRVLgIMkLL8mrIJ1IjEVSTI5cxF6CBCLXkIERJZuUZnXIf3tvfvwe93Y7ndrzfe83jk3ozK+w722Yt5rF9dszO2CXrMmF9s46sM+vcbtsXds8ePFltq+7ZYq/Khke5v9GH</latexit>

Hidden layers

• ReLU

𝑙 funzione di loss… defining the network goal

𝑙 funzione di loss

• 𝑔 è un operatore algebrico non lineare

• L’operatore è parametrico rispetto ai pesi:• La matrice 𝑊 e il bias b

• La fase di learning aspira a trovare i migliori valori di 𝑊 e 𝑏

𝑙 funzione di loss

• Problema di ottimizzazione

• Qual’è l’output desiderato?

• Quando è differente dall’output prodotto?

𝑙 funzione di loss

• La loss misura la discrepanza tra l’output predetto e quello desiderato

• La funzione obiettivo:

argmin),*

1𝑛>"+,

$

𝑙𝑜𝑠𝑠 𝑦" , 𝑔 𝑥"|𝑊, 𝐵

𝑙 funzione di loss

• Se l’output è una classe:• Binary Cross Entropy (BCE) – 𝑦! ∈ 0; 1 , 𝑔 �⃗�!|𝑊, 𝐵 ∈ 0; 1

BCE = −1𝑛7!"#

$

𝑦! ln 𝑔 𝑥!|𝑊, 𝐵 − 1 − 𝑦! ln 1 − 𝑔 𝑥!|𝑊, 𝐵

• Categorical Cross Entropy (CCE) – 𝐾 classes, 𝑦!,& ∈ 0; 1 , 𝑔 𝑥!|𝑊, 𝐵 ∈ 0; 1

CCE = −1𝑛7!"#

$

7&"#

'

𝑦!,& ln 𝑔 𝑥!|𝑊, 𝐵 &

• Hinge – 𝑦! ∈ −1; 1

Hinge =1𝑛7!

$

max 0; 1 − 𝑦! ⋅ 𝑔 𝑥!|𝑊, 𝐵

𝑙 funzione di loss

• BCE:• Classi binarie• Pesa gli errori allo stesso modo

• CCE:• Classi multiple• Pesa gli errori allo stesso modo

• Hinge• Classi binarie• Pesa gli errori allo stesso modo • Non differenziabile• Penalizza le predizioni con confidenza bassa

• Vicina a 0 quando i segni coincidono e la predizione è vicina a 1

𝑜 L’ottimizzatore… finding optimal solutions

𝑜 l’ottimizzatore

• Problema di ottimizzazione

argmin),*

1𝑛>"+,

$

𝑙𝑜𝑠𝑠 𝑦" , 𝑔 𝑥"|𝑊, 𝐵

• Stochastic Gradient Descent• E sue varianti

𝑜 l’ottimizzatore

• Più controllo sugli update

• Momentum update

𝑊XYR∗ = 𝑚 ⋅𝑊X

∗ − 𝜂∇𝑙Z

𝑙ZYR = 𝑙ZYR +𝑊XYR∗

• Annealing• Learning rate adattivo• E.g. Exponential decay 𝜆! = 𝜆" ⋅ 𝑒#$!

𝑜 l’ottimizzatore

• Varianti di SGD

• Considerano

• Raggiungibilità

• Convergence speed

• Overfitting

𝑜 l’ottimizzatore

• Varianti di SGD• AdaGrad

𝑊XYR∗ = 𝑊X

∗ −𝜂

𝜖 ⋅ 𝐼 + diag ∇𝑙Z ⋅ ∇𝑙Z[∇𝑙Z

• I pesi con gradiente alto hanno un learning rate ridotto• Pesi con gradiente piccolo hanno un learning rate ampliato

𝑜 l’ottimizzatore

• Varianti di SGD• RMSprop

𝜁\YR = 𝛼 ⋅ 𝜁X + 1 − 𝛼 ⋅ ∇𝑙Z W

𝑊XYR∗ = 𝑊X

∗ −𝜂

𝜖 ⋅ 𝐼 + 𝜁XYR∇𝑙Z

• Riduce la policy aggressive di AdaGrad sulla riduzione del learning rate

𝑜 l’ottimizzatore

• Varianti di SGD• Adam

𝜁!"# = 𝛼 ⋅ 𝜁$ + 1 − 𝛼 ⋅ ∇𝑙% & → 𝜁!"#∗ =𝜁$"#

1 − 𝛼 $"#

𝑚$"# = 𝛽 ⋅ 𝑚$ + 1 − 𝛽 ⋅ ∇𝑙% → 𝑚$"#∗ =

𝑚$"#

1 − 𝛽 $"#

𝑊$"#∗ = 𝑊$

∗ −𝜂 ⋅ 𝑚$"#

𝜖 ⋅ 𝐼 + 𝜁$"#∇𝑙%

• RMSprop con smoothing

𝑜 l’ottimizzatore

• Confronto

(Source: Stanford class CS231n, MIT License, Image credit: Alec Radford)

𝑖 l’inizializzazione… well begun is half done

𝑖 l’inizializzazione

• I pesi necessitano un valore iniziale

• L’inizializzazione ha un effetto significativo sul risultato finale

𝑖 l’inizializzazione

• Zero initialization

• Bad

• Tutti I nodi hanno lo stesso gradiente

• Non c’è diversificazione

• Simmetria

𝑖 l’inizializzazione

• Random initialization