Compared to MCMC, variational Inference is a faster method to approximate difficult-to-compute posterior distribution. Variational Inference is an optimization problem, while MCMC is an asymptotic method.
1. Notation
Consider a joint density of latent variables $\mathbf{z}=z_{1:m}$ and observation $\mathbf{x}=x_{1:m}$:
$$\mathbb{P}(\mathbf{x}, \mathbf{z}) = \mathbb{P}(\mathbf{x}|\mathbf{z}) \mathbb{P}(\mathbf{z})$$
The above equation unveils the data generation process. The inference process can be written as the following form:
$$\mathbb{P}(\mathbf{x}, \mathbf{z}) = \mathbb{P}(\mathbf{z}|\mathbf{x}) \mathbb{P}(\mathbf{x})$$
The job is to find the posterior probability of latent variable $\mathbf{z}$. MCMC constructs an ergodic Markov Chain on $\mathbf{z}$$ whose stationary distribution is the posterior $\mathbb{P}(\mathbf{z}|\mathbf{x})$, while VI consider the following optimization problem:
$$q^{*}(\mathbf{z}) = argmin_{q \in \mathcal{Q}} d(q(\mathbf{z}), p(\mathbf{z}|\mathbf{x}))$$
, where $\mathcal{Q}$ is a family of approximation densities $\mathcal{Q}$, $d$ is a function which measure the difference of two distribution. The common choices of $d$ are KL divergence and Wasserstein distance. Note that VI cannot guarantee the we are sampling from the exact posterior distribution but one which is similar to the posterior distribution. In this note, the $d$ function is KL divergence in the following context.
2. KL divergence
Note that KL divergence is not a good enough ‘distance’, although KL is always positive , which means if KL is zero, then two distributions are the same almost surely. The following proof will be an exercise for the application of variational calculus.
$\mathbf{Proof: KL(q(z) | p(z|x)) >=0}$
Note that g, f are pdf (measurable functions) over $(\Omega, \sigma_{\Omega})$, we can freely exchange derivative and integral.
Let’s rewrite this problem as:
Given $\int_{x \in \Omega} g(x)dx = 1$:
$$\max \int_{x \in \Omega} g(x) log(f(x))dx$$
$$s.t. \int_{x \in \Omega} f(x)dx = 1$$
The Lagrange equation for this optimization problem is:
$$F(\lambda) = \int_{x\in\Omega} log(f(x))dx - \lambda (\int_{x \in \Omega}f(x)dx - 1)$$
For the optimal $\bar{f}$, if we add a disturb to this function:
$$F(\theta, \lambda) = \int_{x\in\Omega} log(\bar{f}(x) + \theta y(x))dx - \lambda (\int_{x \in \Omega}\bar{f}(x)+ \theta y(x)dx - 1)$$
The first order condition can be written as:
$$\frac{\partial F}{\partial \theta}\Big | _{\theta = 0} = 0 \Leftrightarrow \forall y \quad \int (\frac{g(x)}{\bar{f}(x)} - \lambda)y(x)dx = 0 \Leftrightarrow \frac{g(x)}{\bar{f}(x)} = \lambda \quad a.s.$$
$$\frac{\partial F}{\partial \lambda} \Big |_{\theta = 0} \Leftrightarrow \int \frac{1}{\lambda} g(x)dx = 1 \Leftrightarrow \lambda = 1$$
Therefore, the necessary condition for reaching the maximum of $\int_{x \in \Omega} g(x) log(f(x))dx$ is $\bar{f} = g$ a.s.
Based on the definition of KL divergence, we can easily conclude $KL(q(z) | p(z|x)) >=0$.
Kl divergence is closely related to the concept of entropy. A high self-entropy means chaos(high volatility), while a low self-entropy means lower volatility.
KL divergence has the following drawbacks:
1) KL divergence tends to pay more emphasis on the region where $q$ has little mass (caused by the log function), which means KL underestimate the variance.
2) KL divergence is not symmetric, so it is not a metric.
3. ELBO
According to the definition:
$KL(q(z)|p(z|x)) = \mathbb{E}_{q}(q(z)) - \mathbb{E}_{q}(p(z|x)) = \mathbb{E}_{q}(q(z)) - \mathbb{E}_{q}(p(z, x)) + log p(x)$
We can further conclude that $log p(x) = KL(q(z) | p(z|x)) + ELBO(q)$, which means $log p(x) \ge ELBO(q)$ (Lower bond for $log p(x)$)
Note that in VI, the KL divergence cannot be $KL(p(z|x)|q(z)) $ since the posterior distribution is difficult to sample or approximate. However, given a variational family we are easy to sample from $q$. In the computation process, reparameterization trick (or MC integral for applied mathematics guys) will be helpful. Therefore, minimize the difference between $q$ and posterior distribution is equivalent to maximize the ELBO:
$$ELBO(q) = \mathbf{E}_{q}(p(z, x)) - \mathbf{E}_{q}(q(z)) $$ (recall EM algo)
Note that we need the computable gradient of ELBO if we want to use the first order condition of maximization. The most important trick here is MC integral. Let $\Phi$ denote the parameters of variational family $\mathcal{Q}$. The gradient computation can be written as:
$$\nabla_{\Phi}ELBO = \nabla_{\Phi} \mathbf{E}_{q}[log(p(x, z))] - \nabla_{\Phi} \mathbf{E}_{q}[log(p(z))]$$
$$= \nabla_{\Phi} \int [log(p(x,z)) - log(q)]q(z) dz = \int \nabla_{\Phi}\Big([log(p(x,z)) - log(q)]q(z)\Big) dz$$
$$=\int \nabla_{\Phi} \Big(log(p(x, z) - log(q(z))\Big)q(z)dz + \int \nabla_{\Phi} \Big(q(z)\Big)[log(p(x, z)) - log(q(z))]dz$$
Note that the first term can be easily calculated if we notice the following fact:
$$\mathbb{E}_{q}\Big[\nabla_{\Phi} log[q(z)]\Big] = \mathbb{E}_{q}\Big[\ \frac{\nabla q(z)}{q(z)} \Big] = \int \frac{\nabla q(z)}{q(z)} q(z) dz = 0$$
Therefore, the first term is 0. The gradient of ELBO can be further written as:
$$\nabla_{\Phi}ELBO = \int \nabla_{\Phi} \Big(q(z)\Big)[log(p(x, z)) - log(q(z))]dz$$
Given the equation:
$$\nabla_{\Phi}q(z) = q(z) \nabla_{\Phi} log(q(z))$$
The gradient can be written in a form of expectation:
$$\nabla_{\Phi}ELBO = \int q(z) \nabla_{\Phi} log(q(z))[log(p(x, z)) - log(q(z))]dz = \mathbf{E}(\nabla_{\Phi} log(q(z))[log(p(x, z)) - log(q(z)))$$
$$\simeq \frac{1}{S} \sum_{i=1}^S \nabla_{\Phi} log(q(z_i))[log(p(x, z_i)) - log(q(z_i))$$
The above process is also called vanilla BBVI algorithm. Note that we can further decrease the variance of the estimation with some variance reduction method.
Reference
Variational Inference: A review for Statisticians
Comments