Keywords: support of distribution, forward KL, reverse KL, absolute continuity (greatly influence the properties of approximation)
1. A review of KL divergence
If $P$ and $Q$ are probability measures over a set $\chi$, and $P$ is absolutely continuous with respect to $Q$, which means $P << Q$ i.e. $\forall \epsilon \quad Q(\epsilon) = 0 \Rightarrow P(\epsilon) = 0$. Then the KL divergence can be defined as:
$D_{KL}(P|Q) = \int_{\chi} \frac{P}{Q} P(dx)$
Due to the definition of measure i.e. $B\in A, \mu(A-B) = \mu(A) - \mu(B)$
$$D_{KL}(P|Q) = \int_{\chi} \log(\frac{P}{Q}) P(dx) = \int_{\chi} \log(\frac{P}{Q}) dP = \mathbb{E}_{P}[\log \frac{P}{Q}]$$
Note that this formula also works for discrete distribution depending on the definition of $dx$ over set $\chi$.
The properties of KL divergence includes:
1) asymmetry. $KL(P|Q) \neq KL(Q|P)$
2) Non negative. See Probabilistic-Programming-Notes-1. Therefore, a zero $KL(P|Q)$ or $KL(Q|P)$ means $P = Q, a.s.$
3) The properties of absolute continuity of P w.r.t. Q.
The base in the logarithms is e, or 2 (if information is based on bits).
2. Forward and reverse KL in Probabilistic Programming
Let’s try to approximate some distribution $p$ with $q$ which is parameterized by $\phi$. Define forward KL as
$KL(p|q)$ and reverse $KL(q|p)$.
2.1 The properties of forward KL
The support of $p$ have no influence on the support of $q$. The event with zero probability in $p$ may have positive probability in $q$, which potentially overestimate the variance of the distribution $p$.
The properties of forward KL in mean-field variational inference
Conclusion:
Consider a factored approximation $q(x, y) = q_x(x)q_y(y)$ to a joint distribution $p(x, y)$. The forward KL reaches minimum when $q_x(x) = p(x)$ and $q_y(y) = p(y)$
Proof:
By definition, $KL(p||q)$ can be written as:
$KL(p||q) = \mathbb{E}_{p}[\log p(x, y)] - \mathbb{E}_{p}[\log q(x) + \log q(y)]$
Note that $\mathbb{E}_{p}[\log p(x, y)]$ is constant. Therefore, minimize $KL(p||q)$ is equivalent to maximize the $\mathbb{E}_{p}[\log q(x) + \log q(y)]$.
Note that we are searching for the optimal $q(\cdot)$ with constraints $\int_{\Omega_x} q(x) dx = 1$. For any integrable function $h(x)$ and real number $\theta$, let’s define function $f(\theta, \lambda)$:
$f(\theta, \lambda) = \iint\limits_{\Omega} \log [q(x) + \theta h(x)] p(x, y) dx dy - \lambda (\int [q(x) + \theta h(x)] dx - 1)$
For the optimal $q(x)$, $f(\theta, \lambda)$ reaches optimal when $\theta = 0$. Therefore,
$\frac{\partial f}{\partial \theta} \Big |_{\theta = 0} = \iint\limits_{\Omega} \frac{h(x)}{q(x)}p(x, y)dxdy - \lambda \int h(x) dx = \int h(x)[\frac{p(x)}{q(x)} - \lambda] dx = 0$
Due to the arbitrary $h(x)$, we can conclude $\frac{p(x)}{q(x)} - \lambda = 0$
Besides, the first order condition over $\lambda$ can be written as:
$\frac{\partial f}{\partial \lambda} \Big|_{\theta = 0} = \int q(x) dx - 1 = \lambda \int p(x) dx - 1 = 0 $
Therefore, $\lambda = 1$, and $p(x) = q(x)$. In the same method, we can also conclude $p(y) = q(y)$.
2.2 The properties of reverse KL
Again, start from the properties of absolute continuity, all the event with zero event in $p$ will be rigorously subject to 0 in $q$. Therefore, we will underestimate the variance of $p$ if we try to use $q$ to approximate $p$. The advantage of the reverse KL is that we can efficiently sample from $q$. In most situation, we use the simple distribution functions like Gaussian or functions in exponential distribution family, which can be easily sampled.
Therefore, we prefer the reverse KL in both CAVI and BBVI.
Comments