Stochastic Approximation Methods-Notes-Part 1-An Overview and Robbins Monro Algorithm

Generally, stochastic approximation methods are a family of iterative methods. The goal of these algorithms is to recover some properties of a function depending on random variables. The application of stochastic approximation ranges from deep learning (e.g., SGD ) to online learning methods.

1. An introduction to Stochastic Approximation Methods

In this section, I will show you some examples of stochastic approximation methods.

1.1 Example 1: Expectation

For a given probability space $(\Omega, \mathcal{F}, P)$, we want to solve the expectation of $E^P(f(\omega))$, where $f$ is a measurable function, and $\omega \in \Omega$. A reasonable solution is to estimate the expectation via arithmetic average. The following equations provide us an iterative method to this problem.

Let $\mathbb{E}_n$ be the estimation of $\mathbb{E}^P(f(\omega))$ given $n$ observation.

$n \times E_n = \sum_{i = 1}^n f(\omega_i)$

$ (n+1) \times E_{n+1} = \sum_{i = 1}^{n+1} f(\omega_i) = n \times E_n + f(\omega_{n+1}) $

$ \implies (n + 1) \times E{n+1} = (n+1) \times E_n + f(\omega_{n+1}) - E_n $

$ \implies E_{n+1} = E_n + \frac{1}{n+1} [f(\omega_{n+1}) - E_n] $

1.2 Example 2: SGD

In the framework of stochastic gradient descent (SGD), or on-line gradient descent, the true gradient $\nabla Q(\omega)$ is approximated by a gradient at a single sample. Let $\omega_n$ be the estimation of true gradient in the $nth$ iteration, then

$\omega_{n+1} = \omega_n - \lambda \nabla Q(\omega)$

where $\lambda$ is the learning rate.

There are several popular extensions of the SGD method, including Momentum, Nesterov Momentum, and Adam.

1.3 Example 3: Q-Learning in reinforcement learning

In the field of reinforcement learning, function $Q: S \times A \xrightarrow{} \mathcal{R}$ combines the reward with states and actions. At each time $t$, the agent select an action $a_t$, get an reward $r_t$ from the environment, and enter into the next state $s_{t+1}$. The discussions of convergence of RL cannot avoid the topics on the Q function or the value function.

$$Q^{new}(s_t, a_t) = (1 - \alpha)Q^{old}(s_t, a_t) + \alpha (r_t + \gamma \max \limits_a Q^{old}(s_{t+1}, a))$$

where $\alpha$ is the learning rate, and $\gamma$ is the discount factor.

2. Robbins-Monro Algorithm

In 1951, Robbins and Monro developed a methodology for solving a root finding problem given a function depends (or partially depends) on a random variable. Root finding problem is the generalization of several important topics, including optimizations and extrema finding.

2.1 Root finding problem without uncertainty

Let $M(x)$ be a given function and $\alpha$ a given constant such that the equation

$M(x) = a$

has a unique root $x = \theta$. For any iterative method for this root finding problem, we find one or more initial values $\ x_{initial}$, and then successively obtain new values as certain functions of the previous obtained $x$. The most famous algorithm is the Newton–Raphson method.

2.2 A stochastic generalization of the root finding problem

Let’s consider how to solve this problem when $M(x)$ is unknown for some reason. Let’s assume the randomness of the function is captured by the probability space $(\Omega, \mathcal{F}, P)$, and there is a random variable $Y$ on $(\Omega, \mathcal{F}, P)$ which takes values in a state space $(S, \mathcal{L})$. In this blog, we assume the state space is a d-dimensional Euclidean space equipped with the $\sigma$-field of Borel sets. In this algorithm, $M(x)$ is a deterministic function which defined as follows:

$M(x) = \int_{y \in S }ydH(y|x)$

where $H(y|x) = \mathbb{P}[Y(\omega) \le y|x]$ ($\omega \in \Omega$) is the distribution function of $Y$ given an unobservable deterministic variable $x$.

In other word, $M(x)$ is the expectation of $Y$ for the given $x$. Although we know nothing about the nature of function $M(x)$ and the distribution function $H(y|x)$, but we assume (which mean this assumption will work as a known condition in the proof) that the equation $M(x) = a$ has a unique solution $x = \theta$. Bobbins-Monro algorithm introduced a method to estimate $\theta$ by making successive observations on $Y$ at given $x_1, x_2, …$, which are determined sequentially in accordance with some definite experimental procedure.

The idea of this algorithm is to construct a transition probability between two states. Moreover, with this sequence, the difference between $\lim \limits_{k \xrightarrow{} \infty}x_k$ and $\theta$ can be controlled. We can conclude this idea as finding a sequence $\{x_k\}$, such that

$\lim \limits_{k \xrightarrow{} \infty}E(x_k - \theta)^2 = 0$

This idea also implies the convergence in probability (or in $\mathcal{L}^2$) of $x_n$ to $\theta$.

2.3 The proof of convergence theorem

The computational origin of this algorithm is the first example (expectation estimation) in the Introduction section. If we review the expectation estimation problem in the framework of Robbins-Monro, then in this example, $x$ denotes the expectation and function $M(x) = \mathbb{E}^P[Y - x]$, and we want to solve the equation $M(x) = 0$. I only present brief proof to show the idea. You may read this paper for more details.

We want to prove that:

Condition 1: For every $x$, a distribution function in $y$, and there exist a positive constant C such that $\mathbb{P}[|Y(\omega)| \le C|x] = \int_{-c}^cdH(y|x) = 1$

Condition 2: There exist finite constants $\alpha, \theta$ such that $M(x) \le \alpha$ for $x \lt \theta$, and $M(x) \ge \alpha$ for $x \gt \theta$

Condition 3: Let $\{a_n\}$ be a fixed sequence of positive constants such that $0<\sum_{1}^{\infty} a_n^2 = A < + \infty$, and $\sum_{1}^{\infty} a_n = + \infty$

For a Markov chain $\{x_n\}$ starts with an arbitrary constant $x_1$:

$x_{n+1} = x_n + a_n(\alpha - y_n)$

where $y_n$ is the value of random variable $Y$ given $x_n$ as we have defined in section 2.2.

If condition 1, condition 2, and condition 3 hold, let $b_n = \mathbb{E}(x_n - \theta)^2$, then $\lim \limits_{n \xrightarrow{} \infty} b = 0$

Proof:

$b_{n+1} = \mathbb{E}(x_{n+1} - \theta)^2 = \mathbb{E}[\mathbb{E}(x_{n+1} - \theta)^2|x_n]$

$b_{n+1} = b_{n} + a_n^2\mathbb{E}[\int_{-\infty}^{\infty}\{(x_n - \theta) - a_n(y - a)\}^2dH(y|x_n)]$

Setting $d_n = \mathbb{E}[(x_n - \theta)(M(x_n) - \alpha)] $, and $c_n = \mathbb{E}[\int_{-\infty}^{\infty} (y - \alpha)^2 dH(y|x_n)]$, then

$b_{n+1} - b_n = a_n^2e_n - 2a_nd_n$

Accumulate from $n = 1$, $b_{n+1}$ can be rewritten to

$b_n = b_1 + \sum_{i = 1}^n a_i^2e_i - 2\sum_{i = 1}^n a_id_i$

Note that for every n, $d_n \ge 0$ and $0 \le e_n \le (C+ |\alpha|)^2$

Since we know the function has unique solution $\theta$, $\sum_{i = 1}^n a_id_i$ must converge.

Due to the definition of Markov chain ${x_n}$,

$x_n = x_1 + \sum_{i = 1}^{n - 1} a_n (\alpha - y_n) \le x_1 + \sum_{i = 1}^{n - 1} a_n (C + |\alpha|)$

$x_n -\theta \le x_1 - \theta + \sum_{i = 1}^{n - 1} a_n (C + |\alpha|)$

$|x_n -\theta| \le |x_1 - \theta + \sum_{i = 1}^{n - 1} a_n (C + |\alpha|)| \le |x_1 - \theta| + \sum_{i = 1}^{n - 1} a_n (C + |\alpha|)$

The last inequality holds since $\sum_{i = 1}^{n - 1} a_n (C + |\alpha|) \gt 0$.

Let $A_n = |x_1 - \theta| + \sum_{i = 1}^{n - 1} a_n (C + |\alpha|)$, note that $\mathbb{P}\{|x_n - \theta| \le An\} = 1$.

Now, we want to find a sequence $\{k_n\}$ of nonnegative constants such that $d_n \ge k_nb_n$, $\sum_{i = 1}^{\infty}a_nd_n = \infty$. The goal of this step is to prove $\lim \limits_{n \xrightarrow{} {\infty}}b_n = 0$, since we have known $\sum_{i = 1}^n a_id_i$ converge._

In another word, if we find such sequence $\{k_n\}$ with the aforementioned two condition (1.$d_n \ge k_nb_n$, 2. $\sum_{i = 1}^{\infty}a_nd_n = \infty$), we can prove $\lim \limits_{n \xrightarrow{} {\infty}}b_n = 0$.

Set $\bar{k_n} = \inf [\frac{M(x) - \alpha}{x - \theta}]$.

On one hand, $d_n \ge \int_{|x_n - \theta| \le A_n}\bar{k_n}|x - \theta|^2dP_n(x) = \bar{k_n}b_n$, which means $d_n \ge k_nb_n$ holds.

On the other hand, we can assume without generality that $\bar{k_n} \ge \frac{K}{A_n} $ for some given $K \gt 0$here. Moreover, condition 3 implies that

$\sum_{n = 2}^{\infty} \frac{a_n}{a_2 +…+a_{n-1}} = \infty $

, and for sufficiently large n:

$A_n \le 2\sum_{i = 1}^{n - 1} a_n (C + |\alpha|)$

These two conclusions indicate that

$a_n\bar{k_n} \ge \frac{a_nK}{A_n} \ge \frac{a_nK}{2\sum_{i = 1}^{n - 1} a_n (C + |\alpha|)}$

, which implies $\sum_{i = 1}^{\infty}a_nd_n = \infty$ holds.

$\therefore \lim \limits_{n \xrightarrow{} {\infty}}b_n = 0$ holds. The convergence is proved.

Kalman Filter-Notes-Part1-Discrete Kalman Filter Hello World

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×