Probabilistic programming-Notes-Varational inference-Part2

This note includes mean-field and structured variational families. Besides VI, mean-field and other variational family can also be used in the inference of probabilistic neural network.

1. Mean-field variational family

As the introduction in the last note, the variational family $\mathcal{Q}$ determines the optimization problem. Although we have AD algorithm, it is easier to optimize over a simple family. The mean-field variational family assumes all the latent variables are mutually independent. A generic member of the mean-field variational family can be written as:

$$q(z) = \Pi_{i=0}^{i=n} q_{i}(z_i)$$

For mean-field variational family, we can use coordinate ascent algorithm(CAVI) to solve the optimal problem.

First, note that ($p(\cdot)$) is the prior distribution.

$$ELBO(q) = \mathbf{E}_q(p(x,z)) - \mathbf{E}_q(q(z)) = \mathbf{E}_q(p(x|z)) + \mathbf{E}_q(p(z))- \mathbf{E}_q(q(z)) $$

$$= \mathbf{E}_q(p(x|z)) - KL_q(q(z) |p(z))$$

The above equation shows that $q$ tends to approximate both the likelihood and priori distribution of z, which is a balance between terms. The maximization of $ELBO$ possess the properties of Bayesian estimations to some extent. The nice properties of mean-field variational family introduce an efficient optimization algorithm, which is the well know coordinate ascent VI (CAVI).

1.1 CAVI

According to the definition of expectation (it is obvious if you rewrite the ELBO as integral), we can conclude that:

$$ELBO(q_j) = \mathbb{E}_{j}\Big[\mathbb{E}_{-j}[\text{log} p(z_j, \mathbf{z}_{-j}, \mathbf{x})] \Big] - \mathbb{E}_{j}\Big[\text{log} q_j(z_j)\Big] + const$$

Again, I will solve this with variational calculus intentionally. Let $f(z_j) = \mathbb{E}_{-j}[\text{log} p(z_j, \mathbf{z}_{-j}, \mathbf{x})] $. At the optimal pdf $q^{*}(z_j)$, let’s disturb the function with function $y(z_j)$. The objective function:

$$F(\theta) = \int_{z_j} f(z_j)(q^{*}(z_j) + \theta y(z_j))dz_j - \int_{z_j} (q^{*}(z_j) + \theta y(z_j)) \text{log}(q^{*}(z_j) + \theta y(z_j))dz_j$$

The first-order condition is:

$$\frac{dF}{d \theta} \Big | _{\theta = 0} = \int_{z_j} f(z_j) y(z_j)dz_j - \int_{z_j} y(z_j)dz_j - \int_{z_j} \text{log}q^{*}(z_j)y(z_j)dz_j = 0$$

$$q^{*}(z_j) = \text{exp}(f(z_j) - 1)$$

Therefore, for each step of CAVI, we can just set $q^{*}(z_j) \propto \text{exp}(f(z_j))$ (recall the Gibbs sampling) to reach a local optimum.

Note that mean-field approximation cannot fit complex posterior (like XOR). However, the mean field approach yields a globally consistent set of moments.

2. Structured variational family

Although mean-field variational family make the problem tractable, it reduce the fidelity. A natural idea is to add a “new layer” to the model. In structured variational family the joint distribution can be factorized as:

$$p(y,z,\beta) = p(\beta)\Pi p(y_n, z_n | \beta)$$

, which means $(y_n, z_n)$ is conditionally independent given global parameter $\beta$.

Note that we can also assume $p(y,z,\beta) = p(\beta)\Pi p(y_n| z_n)q(z_n | \beta)$, which means $y_n$ is independent of $\beta$ given $z_n$. Most literature restrict their attention on conditionally conjugate models.

2.1 A review of the exponential distribution family

Exponential distribution family admit conjugate priors. An exponential family distribution can be parameterized by a vector of expectation parameters $\mu = \mathbb{E}_{p(y)}\Big[T(y)\Big]$, where $T(y)$ is the vector of sufficient statistics of the data. We can assign a conjugate prior distribution $p(\mu)$to these parameters, which is parameterized by the prior expectation $\bar{\mu_0} = \mathbb{E}_{p(\mu)}[\mu]$. Upon observing N independently sampled data points, the posterior expectation parameters are convex combination of the prior parameters and the maximum likelihood estimators:

$$\bar{\mu} = \lambda \circ \mu_0 + (1-\lambda) \circ \mu_{ML}$$

The exponential family distribution admits the conjugate prior. That’s the reason why we want to use this function

2.2 Conditionally conjugate model

Assume:

$$p(\beta) = h(\beta) \text{exp}(\eta \cdot t(\beta) - A(\eta))$$

$$p(y_n, z_n|\beta) = \text{exp} \Big(t(\beta) \cdot \eta_{n}(y_n, z_n) + g_n(y_n, z_n)\Big )$$

, where the base measure $h(\cdot)$ and log-normalizer $A(\cdot)$ are scalar-valued functions, $\eta$ is a vector of natural parameters, $t(\beta)$ is a vector-valued sufficient statistic function, $g_n$ is a scalar-valued function and $\eta_n$ is a vector-valued function. The posterior distribution can be written as:

$$p(\beta | y, z) = h(\beta) \text{exp} \Big( (\eta + \sum_{n} \eta_n (y_n, z_n)) \cdot t(\beta) - A(\eta + \sum_n \eta_n (y_n - z_n))\Big)$$

The goal is to approximate the intractable posterior $p(z, \beta | y)$ with a distribution $q(z, \beta)$ in some tractable family by solving an optimization problem. Note that this assumption is about how we believe the real data is generated.

2.2.1 Method 1: Mean-field assumption for $q(z, \beta)$

$$q(z, \beta) = q(\beta) \Pi_n \Pi_m q(z_n,m)$$

Recall that $z$ is a $m$ dimensional vector.

2.2.2 Method 2: Structured Stochastic VI

In this framework, the variational distribution $q$ is of the form:

$$q(z, \beta) = (\Pi_{k} q(\beta_k)) \Pi_{n} q(z_n | \beta)$$

The papers restrict $q(\beta)$ to be in the same exponential family as the prior $p(\beta)$, so that:

$$q(\beta) = h(\beta) \text{exp} \Big(\lambda \cdot t(\beta) - A(\lambda)\Big)$$, where $\lambda$ is a vector of parameters that controls $q(\beta)$. We also require that any dependence under $q$ between $z_n$ and $\beta$ be mediated by some vector-valued function $\gamma_n(\beta)$, so that we may write $q(z_n|\beta) = q(z_n | \gamma_n(\beta))$. Review the probabilistic graph in case you feel confused about some dependency relationships.

Objective for structured stochastic VI

Our goal is to find a distribution $q(\beta, z)$ that has low KL divergence to the posterior $p(\beta, z | y)$. The KL divergence between $q$ and full posterior is:

$$KL(q_{z, \beta} | p(z, \beta |y)) = -\mathbb{E}_q[\text{log} p(y, z, \beta)] + \mathbb{E}_q[\text{log} p(z, \beta)] + \text{log} p(y)$$

The ELBO can be further written as:

$$ELBO = \mathbb{E}_q[\text{log} p(y, z, \beta)] - \mathbb{E}_q[\text{log} p(z, \beta)] = \mathbb{E}_q [\text{log} \frac{p(\beta)}{q(\beta)}] + \sum_n \mathbb{E}[\text{log} \frac{p(y_n, z_n |\beta)}{q(z_n | \beta)}]$$

$$= \int_{\beta} q(\beta) (\text{log}\frac{p(\beta)}{q(\beta)} + \sum_n \int_{z_n} q(z_n | \beta) \text{log} \frac{p(y_n, z_n | \beta)}{q(z_n | \beta)} dz_n)d\beta \le \text{log} p(y)$$

(Remark: Marginal distribution in the second step)

Note that the second term is a ELBO of the marginal probability of the $n$th group of observations:

$$\int_{z_n} q(z_n | \beta) \text{log} \frac{p(y_n, z_n | \beta)}{q(z_n | \beta)}d z_n = -KL(q_{z_n | \beta} | p_{z_n|y_n, \beta}) + \text{log} p(y_n | \beta) \le \text{log} p(y_n | \beta)$$

(Remark: the definition of conditional probability)

Therefore, given this nice structure, we can further conclude that the maximization of $ELBO(q)$ is equivalent to minimizing the ‘local’ KL divergence between $q(z_n|\beta)$ and $q(z_n|y_n, \beta)$. Note that we also assume that the function $\gamma_n{\beta}$ (that’s smart!) that controls $q(z_n | \beta) = q(z_n | \gamma_n(\beta))$ is defined to ensure that:

$$\nabla_{\gamma_{n}} \int_{z_n} q(z_n | \gamma_{n}(\beta)) \text{log} \frac{p(y_n, z_n |\beta)}{q(z_n | \gamma_n(\beta)}dz_n = 0$$

Note that the $ \gamma_{n}(\beta)$ can be implicitly expressed. (I really love this idea)

Reference

Structured Stochastic Variational Inference

CUDA-Notes-Part1-Set up CUDA computation for Tensorflow and Tensorflow2 Probabilistic programming-Notes-Variational Inference-Part1

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×