Binary Classification: “Let ‘Z = X × {0,1}’, let ‘H’ be a set of functions
‘h: X → {0,1}’, and let ‘f (h; (x, y)) = 11
{h(x)̸ =y}
’. Here, ‘f (·)’ is simply the ‘0
− 1 loss function’, measuring whether the binary hypothesis ‘h(·)’
misclassified the example (x,y)”.
This setting’s ultimate goal is a selection of a hypothesis “h
∈
H” based on
a finite number of samples with the least amount of potential risk. In
general, we are expecting that the sample size will improve the
approximation of the risk. It is assumed that “learning guidelines that
enable us to choose such hypotheses are consistent”. In formal terms, we
conclude that “rule A” is consistent with rate “εcons(m)” under distribution
“D” if for all “m”, where “F* = inf
h
∈
H
F(h)”, the “rate” ε(m) is required to
be monotone decreasing with “εcons(m) −→ 0)”.
“ES
∼
Dm [F(A(S)) − F
∗
] ≤ εcons(m)”
We can not choose a "D-based" learning rule because "D" is unknown.
Rather, we need a “stronger requirement that the rule is consistent with rate
εcons(m) under all distributions D over Z”. The key definition is as follows:
“A learning problem is learnable, if there exist a learning rule A and a
monotonically decreasing m→∞ sequence εcons(m), such that εcons(m)
−→ 0, and
∀
D, ES
∼
Dm [F(A(S)) − F
∗
] ≤ εcons(m). A learning rule A for
which this holds is denoted as a universally consistent learning rule.”
The above definition of learnability, which requires a uniform rate of all
distributions, is the most appropriate concept to study learnability of
a hypothesis class. It is a direct generalization of “agnostic PAC-
learnability” to “Vapnik’s General Setting of Learning” as studied by
Haussler in 1992. A potential path to learning is a minimization of the
empirical risk “F
S
(h)” over a sample “S”, defined as
“FS ( h ) = 1/m ∑ f ( h ; z )”
Z,z
“Instance domain and a specific instance.”
H,h
“Hypothesis class and a specific hypothesis.”
f(h,z)
“Loss of hypothesis h on instance z.”
B
“suph,z | f (h; z)|”
D
“Underlying distribution on instance domain Z”
S
“Empirical sample z1,...,zm, sampled i.i.d. from D”
m
“Size of empirical sample S”
A(S)
“Learning rule A applied to empirical sample S”
εcons (m) “Rate of consistency for a learning rule”
F (h)
“Risk of hypothesis h, Ez
∼
D [ f (h; z)]”
“infh
∈
H F(h)”
“Empirical risk of hypothesis h on sample S, 1 ∑z
∈
S f (h; z) m”
“An ERM hypothesis, FS(hˆS) = infh
∈
H FS(h) Rate of AERM for a learning rule”
F
∗
FS ( h )
hˆ S
εerm (m)
εstable (m) “Rate of stability for a learning rule”
εgen (m) “Rate of generalization for a learning rule”
The “rule A” is an “ Empirical Risk Minimizer” if it can minimize the
empirical risk
“F
S
( A(S)) = F
S
( hˆ
S
) = inf
Do'stlaringiz bilan baham: |