Stochastic Approximation Methods-Notes-Part2-Flat Minima and Gradient Descent

Mar 23 2020 Notes

The parameterized function with similar training error widely diverge in the generalization performance. However, the flat minima may imply a low-complexity neural network structure. Some SGD methods have shown can converge to a flatter minima, which potentially make the solution of nonconvex optimization more robust. The first part of this note is a review of Flat minima( Hochreiter and Schmidhuber, 1997). The second part contains an introduction to Gradient Descent algorithms’ properties and visualization.

1. Definitions

A flat minimum is a large connected region in weight space where the error remains approximately constant. It can be rigorously defined with the tolerance error and training error.

1.1 The General Task

We want to approximate the an unknown function f: $\mathbb{R}^n \rightarrow \mathbb{R}$(Note that in the original paper, it is $\mathbb{R}^n \rightarrow \mathbb{R}^k$, and there is no difference if we consider multi-dimensional target space). More specifically, function f mapping a finite set of possible input $X \subset \mathbb{R}^n$ to a finite possible output $Y \subset \mathbb{R}^k$. This function is assumed perfectly describe the real data generation process. The training information is given by a finite set $D_0 \subset D$. We call $D_0$ the training data. Let tuple $(x_p, y_p)$ denotes the pth element in the data set. Let $net(w)$ denotes a realized neural network structure parameterized by $w$. Note that $w_ij$ represent for the weight on the connection from unit $i$to unit $j$.

1.2 Training Error (Empirical risk)

The squared error of a realized structure over a given training set is defined as:

$E(net(w), D_0) = \sum_{p}||y_p-f(w,x_p)||^2$

where $||\cdot||$ is the Euclidean norm.

1.3 Tolerance Error

Note that we want to define a region in weight space where each parameter in this region has small empirical risk and similar output. The so called structure with ‘small’ risk denotes those of lower empirical risk than a given constant $\mathbf{E}_{tol}$.

1.4 Boxes and Flat minima

We want to study the property of a large region of connected acceptable minima, where each weight w in a this set lead to an identical structure (The authors used almost identical here, but I cannot get the point since no definition about any measure or metric space over net function were mentioned in this article). Since accurately approximation to this region is an absolutely difficult task, we can consider a box (or Hypercuboid) in the parameter space. Assume the parameter space is an L-dimension Euclidian space, we can define the concept of box: For each acceptable minimum $w$, its box $M_w$ is an L-dimension hypercuboid with center w. Each edge of the box is taken to be parallel to one weight axis. Let $\Delta w(x)$ be the maximal values such that for all L-dimensional vectors k whose components ${k_ij}_{i=1,\dots,n}$ are restricted by $|k_ij| \le \Delta w_ij(x)$. We can find this box with the constraint:

$\mathbf{E}(net(w), net(w+k), X) = \sum_{x \in X} ||o(w,x) - o(w, x+k)||^2 < \epsilon$

,where the $\epsilon$ represent for the tolerable output change. Note that we assume the target function f is continuous here.

Box’s volume is defined as $V(\Delta w(X)) = 2^L \Pi_{i,j}\Delta w_{i,j}(x)$

2. Information entropy of the minima

Let $B$ denote the bits required to describe the weights, whereas the number to describe the $y_p$,given $w$, can be bounded by fixing $\mathbb{E_{tol}}$. In other words, we can control the smoothness of the function by choosing tolerance error. $B$ can be written as:

$B(w, X_0) = -log(\frac{1}{2^L}V(\Delta w)) = \sum_{i,j} -log(\Delta w_{i,j})$

Of course, we hope the entropy as low as possible. Therefore, the authors recommend a penalty term into the objective function:

$E(w, D_0) = E(net(w), D_0) + \lambda B(w, X_0)$

, where $B(w, X_0) = \sum_{x_p \in X_0} B(w, x_p)$

3. An Introduction to Gradient Descent

Before the introduction to GD methods, let’s define the optimization problem first. We assume the prediction function $f$ has a fixed form and is parameterized by a real vector $w \in \mathbb{R}^d$. Formally, for some given $f: \mathbb{R}^{d_x} \times \mathbb{R}^d \rightarrow{} \mathbb{R}^{d_y}$, the family of prediction function is:

$\mathcal{H} = {f(\cdot,w):w \in \mathbb{R}^d}$

Ideally, we want to minimize the expected risk. We assume there exists a probability space $(\Omega, \mathcal{F}, \mathbb{P})$, where $\Omega=\mathcal{X}\times \mathcal{Y}$, $\mathcal{F}$ is the filtration of $(x,y)$, and $\mathbb{P}$ is the joint probability measure of $(x, y) \in \Omega$. The expected risk is defined as:

$R(w) = \int_{\mathbb{R}^{d_x}\times\mathbb{R}^{d_y}}l(f(x;w),y)d\mathbb{P}(x,y) = \mathbb{E}(l(f(x;w),y))$

,where function $l(\cdot, \cdot)$ is the loss function, which measure the difference between prediction and labels.

The following statements are all based on the optimization problem we defined above. (Reference: An overview of gradient descent optimization algorithms, Sebastian Ruder)

It is worth explaining that the convergence speed depends both on the data set and hyperparameters of optimizers. Generally, if the optimizer converge at a slow speed or diverge sometimes, we can adjust the initial step scale.

3.1 Full Gradient Descent

The Gradient Descent is an first order iterative method for finding the local minimum of a differentiable function, which also provides the core optimization methodology in machine learning. Given a function $f(x)$, the basic GD method can be written as:

$x^{t+1} = x^{t} -\eta \nabla f(x^{t})$

, where $\eta$ is represented for learning rate.

In full GD method, we have to run through all the data samples in our training set to accomplish one update for a parameter. Therefore, if the size of data set is very large, the update process will take a lot of time. More specifically, the full GD method can be written as:

$w^{t+1} = w^{t} - \eta \nabla_w \sum_{i = 0}^{n}{l(f(x_i;w_t), y_i)}$

3.2 Stochastic Gradient Descent

The origin of Stochastic GD method is stochastic approximation. Stochastic GD method is a specific case of Robbins-Monro algorithm. The stochastic GD can be written as:

$w^{t+1} = w^{t} - \eta \nabla_w l(f(x_i; w_t),y_i)$

In other words, we use only one sample (or a subset of samples) to update the parameter. Compared to full GD method, the SGD method often converges much faster. Note that ${w_t}_{t\ge0}$ is a stochastic process (a family of random variables defined on the same probability space) which is defined on the filtration generated by $x_i$. Note that we actually want to reduce the expected risk function, which implies that the expectation of gradient should be the best choice. From this perspective, the variance of estimation based on a single data point is much higher than the average over whole samples. Therefore, SGD may lead to a more oscillated convergence compared to full GD (but still good enough).

3.3 SGDM

Each update step in SGDM is a combination of the steepest descent direction and the most recent iterate displacement. The iteration process can be written as:

$v_k = \gamma v_{k-1} + \eta \nabla_w l(f(x_i; w_k),y_i)$
$w_{k+1} = w_k - v_k$

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. This method is most effective when the parameters work differently in the system. For example, the data generated linearly with $w_1 = 100$, and $w_2 = 0.1$.

3.4 AdaGrad

The AdaGrad method adjusts the learning rate with respect to the the history gradients. The learning rate gradually decreases with the decrements of the risk function. We have articulated that the unsuitable learning rate may lead to oscillation when we approach the local optima. This method update the learning rate in adaptive way. More specifically,

$ g_{k} = \nabla_w{l(f(x_k;w_k), y_k)}$
$ w_{k+1} = w_{k} - \frac{\eta}{\sqrt{G_k+ \epsilon}}\odot g_{k} $

, where $G_{k}$ is the sum of the square of the gradients up to the $k$-th update, and $\epsilon$ is a smoothing term that avoids division by zero. Adagrad’s main weakness is its accumulation of the squared gradients in the denominator, and this in turn causes the learning rate to shrink, which stop us from learning any information from the data. Moreover, although it is still reasonable in convex optimization problem, it is a bad property when we get stuck on the saddle point. However, AdaGrad performs well in sparse data set.

3.5 RMSProp

AdaGrad accumulates all past gradients and the learning rate keeps decreasing, which may make the learning much slower over time. RMSProp does not shrink the learning rate as rapidly due to the use of the decaying average. The moving/decaying average of the $G_k$ in the $k$-th update is defined as:
$g_k = \nabla_w{l(f(x_k;w_k), y_k)}$
$G_k = \gamma G_{k-1} + (1-\gamma) g_k^2$

The gradient update under the following method:

$w_{k+1} = w_{k} - \frac{\eta}{\sqrt{G_k + \epsilon}}\odot g_k$

According to Geoffrey Hinton’s lecture Notes, the $\gamma$ here was 0.9.

3.6 AdaDelta

AdaDelta made some extra improvements compared to AdaGrad and RMSProp (actually, they are developed independently). It reduces AdaGrad’s rapidly decreasing learning rate by using the same exponentially decaying average for past squared gradients as RMSProp. In addition, AdaDelta is aware that, in RMSProp, the base unit of the gradient is $\frac{1}{x’s; unit}$. That is if we assume WLOG that the value of risk function have no base unit, then consider $\frac{\partial f}{\partial w}dw$ should have no base unit, which means the base unit of the gradient is $\frac{1}{x’s; unit}$. We should notice that the RMSProp method changes the base unit of $w$. Therefore, the authors suggested to revised this problem with:

$W_k = \gamma W_{k-1} + (1-\gamma)w_k^2$
$w_{k+1} = w_{k} - \frac{\sqrt{W_k + \epsilon}}{\sqrt{G_k + \epsilon}}\odot g_k$

Learning rate is removed. Thus, we do not need to set a fixed number for learning rate.

3.7 ADAM

Adaptive Moment Estimation (Adam) computes adaptive learning rates considering the first and second order information:
$M_t = \beta_1 M_{t-1} + (1-\beta_1) g_{t}$
$V_t = \beta_2 V_{t-1} + (1-\beta_2) g_{t}^2$

Since $M_t$ and $V_t$ are initialized with 0, then the calculation of the first moment and second moment are biased towards zero, which is can be problematic within the first few time steps and when $\beta$’s are close to $1$. To deal with that, we can use the bias-corrected estimations of $V_t$ and $M_t$, which can be written as:
$\hat{M_t} = \frac{M_t}{1-\beta_1^t}$
$\hat{V_t} = \frac{V_t}{1-\beta_2^t}$
,where $\beta_1^t$ and $\beta_2^t$ is the t power of the $\beta_1$ and $\beta_2$

Then update process can be written as:

$w_{k+1} = w_{k} - \eta\frac{M_k}{\sqrt{V_k} + \epsilon}$

The authors proposed default values of 0.9 for $\beta_1$, 0.999 for $\beta_2$, and
$1e-8$ for $\epsilon$.

3.8 Disadvantages of SGD and full GD

Note that GD and SGD method are sensitive to learning rate, and it is not easy to choose a optimal learning rate. Moreover, the schedule of learning rate (or the evolution of learning rate) is a even more difficult task. The pre-defined learning rate schedule may not adapt to the data set’s characteristics. Last but not least, the loss function or the feasible region of the optimization problem may not be convex, which means our algorithm has to avoid numerous saddle and local minima.

#Machine Learning #Optimization