keywords: Machine learning, PAC, PAC learnable
1. Definitions: the formulation of the learning problems
Input space and the set of labels : Let $\mathcal{X}$ denote the set of all possible examples or instances (sometimes also called input space). The set of all possible labels or target values is denoted by $\mathcal{Y}$. By default, the set of all possible values will be ${0, 1}$, i.e. the so-called binary classification.
The definition of concept: Let $c: \mathcal{X} \xrightarrow{} \mathcal{Y}$ denote a mapping from $\mathcal{X}$ to $\mathcal{Y}$. For example, we can formulate the concept of a triangle with a plane, where the points on the plane are denoted as ${0, 1}$. The concept of triangle can be formulated as the point in a specific shape is label of 1. Then the concept of a triangle can be well defined. A set of concepts will be denoted as $C$.
The formulation of learning problem: We assume the samples are IID according to some fixed but unknown distribution D. The learner considers a fixed set of possible concepts $H$, called a hypothesis set, which may not be the same as $C$ (but with the same input space and output space).
The learner receives a sample $S = (x_1, x_2, \dots, x_m)$ drawn iid with distribution of D, as well as the labels $(c(x_1), \dots, c(x_m))$. The learner’s task is to use the labeled sample S to select a hypothesis $h_S \in H$ that has a small generalization error with respect to the concept c. An algorithm is defined as a mapping from $x$ to the selected hypothesis.
The learning task is to formulate the algorithm that can find the mapping from input space to output space(state space) that has the minimized generalization error based on distribution D.
The definition of generalization error and empirical error : Given a hypothesis $h \in H$, a target concept $c \in C$, and an underlying distribution D, the generalization error or risk of h is defined as:
$\mathbb{E}_{x \sim D}[1_{h_s(x) \neq c(x)}] = Pr_{x \sim D}(h_s(x) \neq c(x))$
Note the generalization error cannot be directly evaluated since we do not know both the of the distribution D and the concept c. Therefore, we need to introduce the empirical error based on the observed samples.
Empirical error:
Given a hypothesis $h \in \mathcal{H}$, a target concept $c \in C$, and a sample $S = (x_1, x_2, \dots, x_m)$, the empirical error can be defined as
$\hat{R}(h) = \frac{1}{m}\sum_{i=1}^m1_{h(x_i)\neq c(x_i)}$
Note that we do not know $c$. However, we own the knowledge of the label set.
Comments: The expectation of the empirical error given a set of iid samples is exactly the same as the generalization error. (以及这块和statistical decision没啥不一样的,熟悉下notation就行了)
The definition of PCA learning: A concept class $C$ is said to be PAC-learnable if there exits an algorithm $\mathcal{A}$ and a polynomial function $poly(\cdot, \cdot, \cdot, \cdot)$ such that for any $\epsilon > 0$ and $\delta > 0$, for all distribution D on $\mathcal{X}$ for any target concept $c \in C$, the following holds for any sample size $m > poly(\frac{1}{\epsilon}, \frac{1}{\delta}, n, size(c))$:
$Pr_{S \sim D^m}[R(h_S) \leq \epsilon] \ge 1 - \delta$
, where n is the upper bound on the cost of computational representation of any element $x \in \mathcal{X}$ and $size(c)$ the maximal cost of the computational representation of $c \in C$. (即需要多少成本才能够表征出input space和output space, 联系Monte Carlo在高维度的收敛问题,以及高维球的体积都集中在壳处可以推知,表征维度对于可学习型是存在影响的,我们必须要求一个能被多项式压住的东西。需要无穷样本的问题就是不可学习的问题。)
2. Guarantees for finite hypothesis sets: the consistent case and inconsistent case
2.1 Consistent case
Let $\mathcal{H}$ be a finite set of functions mapping from $\mathcal{X}$ to $\mathcal{Y}$. Let $\mathcal{A}$ be an algorithm that for any target concept $c \in \mathcal{H}$ and $i.i.d$ sample S returns a consistent hypothesis $h_S$: $\hat{R}_s(h_s) = 0$. Then, for any $\epsilon, \delta > 0$, the inequality $\mathbb{P}_{S \sim D^m}[R(h_s) \le \epsilon] \ge 1 - \delta$ holds if
$m \ge \frac{1}{\epsilon} (\log|\mathcal{H}| + \log \frac{1}{\delta})$
The above statement is equivalent to the statement of the generalization bound: for any $\epsilon, \delta > 0$, with probability at least $1 - \delta$,
$R(h_s) \le \frac{1}{m} \big(\log |\mathcal{H}| + \log \frac{1}{\delta} \big)$
Comments