Statistical Inference-Notes-Part5-Likelihood method

Jan 14 2021

Keywords: Score function, Fisher Information, Cramer-Rao Lower Bound, asymptotic properties of likelihood estimator

1. Score function and Fisher information matrix

The score function is defined as the first order derivatives of the log likelihood function. Note that the log likelihood function or score function themselves are random variables. Let $u(x; \theta) = \frac{\partial \log f(x; \theta)}{\partial \theta}$

$\mathbb{E}(u(x;\theta)) = \int_{\mathcal{X}} \frac{1}{f(x; \theta)} f(x; \theta) \frac{\partial f(x; \theta)}{\partial \theta} dx = \frac{\partial}{\partial \theta}\int_{\mathcal{X}} f(x; \theta) dx = \frac{\partial}{\partial \theta} 1 = 0$

$\frac{\partial}{\partial \theta^T}\mathbb{E}(\frac{\partial \log f(x; \theta)}{\partial \theta}) = \mathbb{E}(\frac{\partial l(x; \theta)^2}{\partial \theta \partial \theta^T}) + \mathbb{E}(\frac{\partial l(x; \theta)}{\partial \theta} \frac{\partial l(x; \theta)}{\partial \theta^T}) = 0$

Therefore, $\mathbb{E}(\frac{\partial l(x; \theta)^2}{\partial \theta \partial \theta^T}) =- \mathbb{E}(\frac{\partial l(x; \theta)}{\partial \theta} \frac{\partial l(x; \theta)}{\partial \theta^T}) = -var(u(x; \theta))$

Note that $-\frac{\partial l(x; \theta)}{\partial \theta} \frac{\partial l(x; \theta)}{\partial \theta^T}$ is called observed information matrix(denoted by $e_{i}$ for the ith observation in n iid rv)), and its expectation is called Fisher information matrix (denoted by $i_{n}$ for n iid rv).

If the observations are independent, it will be obvious that $i_n(\theta) = n i_1(\theta)$

2. The Cramer-Rao Lower Bound

Let $W(X)$ be any estimator of $\theta$ and let $m(\theta) = \mathbb{E}_{\theta}\{W(x)\}$. Let $Y = W(X)$ and $Z = \frac{\partial}{\partial \theta} \log f(X; \theta)$. Based on Cauchy inequality, we can conclude that:

$\text{var}(Y)\text{var}{Z} \ge \{\text{cov}(Y, Z)\}^2$

Note that $\text{cov}(Y, Z) = \int w(x) \big\{ \frac{\partial}{\partial \theta} \log f(x; \theta)\big\} f(x; \theta)dx = m^{\prime}(\theta)$

（微分积分换序的一个充分条件是含参积分对参数一致收敛，更严格的条件有被积函数在参数上的偏导数连续；另外这里的cov里没有减去均值，但是这里的均值可以被 $\bar{W}(X)\nabla_{\theta}\int_{\mathcal{X}}f(x; \theta)dx = 0$消除。这里还有另外一个有趣的事情是 $\int_{\mathcal{X}} (W(X) - \bar{W}(X)) \frac{\partial f(X; \theta)}{\partial \theta} dx \neq \frac{\partial}{\partial \theta}\int_{\mathcal{X}} (W(X) - \bar{W}(X)) f(X; \theta) dx$。这里体现了规范书写的重要性，偷懒省略了参数$\theta$时就要想到会有求导忘了它的一天。–再论善恶终有报。）

Note that $\text{var}(Z) = \mathbb{E}\Big\{\frac{\log f(X; \theta)}{\partial \theta} \frac{\log f(X; \theta)}{\partial \theta^T}\Big\}$

Therefore $\text{var}\{W(X)\} \ge \frac{\{m^{\prime}(\theta)\}^2}{i(\theta)}$.

If we have an unbiased estimator, we can further conclude that $m(\theta) = \theta, m^{\prime}(\theta) = 1$.

For any unbiased estimator which achieves the lower bound can be seen to be a MVUE, a minimum variance unbiased estimator.

Let‘s further consider the condition that the equality rather than the inequality holds. Note that $\text{cov(Y, Z)} = \text{var}(Y)\text{var}(Z)$ iff $\text{corr}(Y, Z) = 1, -1$, which means $Y$ must be proportional to $Z$. Thus $\frac{\partial}{\partial \theta} \log f(X;\theta) = a(\theta) W(X) - b(\theta)$, and $\log f(X;\theta) = A(\theta)W(X) + B(\theta) + C(X)$, which is of the form of exponential family. (指数分布族性质太好了，当然Gaussian 尤其好，对于指数分布族来说，我们只需要找到一个$\frac{m^{\prime}(\theta)^2}{i(\theta)}$)最小的就能天然获得MUE. ）

For multi-dimension parameter space, the CRB can be written as:

$cov_{\theta}(W(X)) \ge \frac{\partial m(\theta)}{\partial \theta} [I(\theta)]^{-1}\frac{\partial m(\theta)}{\partial \theta}^T$.

, where $\frac{\partial m(\theta)}{\partial \theta}$ is the Jacobian matrix, and $A \ge B$ means $A-B$ is semi-definite.

3. Asymptotic properties of maximum likelihood estimators

3.1 A review of asymptotic theorems

The strong law of large number (SLLN) says the sequence of random variables $Y_n = n^{-1} (X_1 + X_2 + \dots + X_n)$ converges almost surely to $\mu$ iff $\mathbb{E} (|X_i|)$ is finite. The weak law of larger number (WLLN) says that $Y_n = n^{-1} (X_1 + X_2 + \dots + X_n)$ converge to $\mu$ with probability if $X_i$ have finite estimation $Y_n \xrightarrow{p} \mu$. Note that we the technical condition for WLLN and SLLN are the same, i.e. finite estimation (the finite variance is not necessary). CLT says that under the condition that $X_i$ are of finite variance $\sigma^2$, $Z_n = \frac{\sqrt{n}(Y_n - \mu)}{\sigma}$, converges in distribution to a random variable $Z$ having the standard normal distribution $N(0, 1)$($Z_n \xrightarrow{d} N(0, 1)$). (SLLN和WLLN的要求是不同的，一些更加复杂的SLLN需要用Borel-Cantelli构造序列；CLT是最弱的收敛（recall：特征函数泰勒展开忽略高阶项），而且分布函数依分布收敛到正态分布，不一定能保证密度函数也以分布收敛到正态。)

Slutsky’s Theorem: if $Y_n \xrightarrow{d} Y$ and $Z_n \xrightarrow{p} c$(WLLN, let alone SLLN), where finite constant $c$. If $g$ is a continuous function (recall the $\epsilon-\delta$ language for continuous, it is obvious since we can always control the difference in state space by control the domain) , $g(Y_n, Z_n) \xrightarrow{d} g(Y, c))$. For example, $Y_n + Z_n \xrightarrow{d} Y + c$, $Y_n Z_n \xrightarrow{d} cY$ , $Y_n/Z_n \xrightarrow{d} Y/c$

3.2 Consistency of MLE estimator

$\hat{\theta}_n \xrightarrow{p} \theta$ is called weak consistent; $\hat{\theta}_n \xrightarrow{a.s.} \theta$ is called strong consistent.

Suppose $f(x; \theta)$ is a family of probability densities or probability mass functions and let $\theta_0$ denote the true value of the parameter $\theta$. For any $\theta \neq \theta_0$ we have by Jenson’s inequality:

$\mathbb{E}_{\theta_0}\{\log \frac{f(X; \theta)}{f(X; \theta_0}\} \le \log\mathbb{E}_{\theta_0}\{ \frac{f(X; \theta)}{f(X; \theta_0}\} = \log \int_{\mathcal{X}} f(x; \theta) dx = 0$. Note that the inequality is strict unless $\frac{f(X; \theta)}{f(X; \theta_0)} = 1$

Let $\mu_1 = \mathbb{E}_{\theta_0} \{\log \frac{f(X; \theta_0- \delta)}{f(X; \theta_0)}\} \le 0$, $\mu_2 = \mathbb{E}_{\theta_0} \{\log \frac{f(X; \theta_0+ \delta)}{f(X; \theta_0)}\} \le 0$

Note that $\frac{l_n(\theta_0)}{n} = \frac{l(x_1; \theta_0) + \dots + l(x_n; \theta_0) }{n}$ if $\mathbb{E}_{\theta_0}(|l(x; \theta_0)|) < +\infty$, $\frac{l_n(\theta_0)}{n} \xrightarrow{a.s.} \mathbb{E}_{\theta_0}(\log f(X; \theta_0))$.

lemma 3.2.1 $\mathbb{E}_{\theta_0}(|\log f(x; \theta_0)|)$ is finite.

Proof:

$\mathbb{E}_{\theta_0}(|\log f(x; \theta_0)|) = \mathbb{E}_{\theta_0}(|\log f(x; \theta_0)|\Big |f(x; \theta_0) \ge 1) \textbf{Pr}\{f(x; \theta_0) \ge 1\}+ \mathbb{E}_{\theta_0}(|\log f(x; \theta_0)|\Big |0 \le f(x; \theta_0) < 1)\textbf{Pr}\{0 \le f(x; \theta_0) < 1\} \le \mathbb{E}_{\theta_0}(|\log f(x; \theta_0)|\Big |f(x; \theta_0) \ge 1) + \mathbb{E}_{\theta_0}(|\log f(x; \theta_0)|\Big |0 \le f(x; \theta_0) < 1)$

Let’s further consider $-\int_{0 \le f(x; \theta) < 1} \log (f(x; \theta)) f(x; \theta) dx \le \int_{0 \le f(x; \theta) < 1} \log (e) e dx = e \textbf{Pr}(0 \le f(x; \theta) < 1) \le e$

Therefore $\mathbb{E}_{\theta_0}(|\log f(x; \theta_0)|) \le 2e$, and we can apply SLLN (let alone WLLN).

Since $\mu_1 = \mathbb{E}_{\theta_0} \{\log \frac{f(X; \theta_0- \delta)}{f(X; \theta_0)}\} \le 0$, according to Slutsky’s theorem, we can conclude that

$\frac{i_n(x; \theta_0 - \delta) - i_n(x; \theta)}{n} \xrightarrow{a.s.} \mu_1 < 0$.

By the same logic, we can conclude that $\frac{i_n(x; \theta_0 + \delta) - i_n(x; \theta)}{n} \xrightarrow{a.s.} \mu_2 < 0$

We can control the $\mu_1, \mu_2 \xrightarrow{} 0$, by controlling $\delta \xrightarrow{} 0$ with $n \xrightarrow{} \infty$. Note that it is a general result. Not necessarily need differentiable over $\theta$ (but need continuous i.e. not necessarily absolute continuous).

Conclusion: SLLN and continuity guarantees the strongly asymptotic consistency of likelihood estimator. （这个强行构造零测集的办法真是暴力美学…）. Although we cannot make any assumptions about the uniqueness of the likelihood estimation, it is unique on any sufficiently small neighborhood.

3.3 The asymptotic distribution of the maximum likelihood estimator

The strong consistency can be guaranteed without any differentiation assumptions on log-likelihood function on $\theta$. However, if we want to figure out the asymptotic properties of distribution, we need assumptions that the log-likelihood function is twice continuously differentiable. Let’s further assume there is a solution for the $l^{\prime}_n(\theta) = 0$.

According to the Taylor expansion with Lagrange remainder, we can conclude that:

$-l^{\prime}_n(\theta_0) = l^{\prime}_n(\hat{\theta}_n) - l^{\prime}_n(\hat{\theta}_0) = (\hat{\theta}_n - \theta_0) l^{\prime\prime}_n(\theta_n^*)$

Let’s first consider a scalar scenario. $\sqrt{n i_1(\theta_0)}(\hat{\theta}_n - \theta_0) = \frac{l_n^\prime(\theta_0)}{\sqrt{n i_1(\theta_0)}} \cdot \frac{l^{\prime\prime}_n(\theta_0)}{l^{\prime\prime}_n(\theta_n^*)} \cdot \Big\{-\frac{l^{\prime\prime}_n(\theta_0)}{n i_1(\theta_0)} \Big\}^{-1}$.

lemma 3.3.1: $\frac{l_n^\prime(\theta_0)}{\sqrt{n i_1(\theta_0)}} \xrightarrow{d} N(0, 1)$

Let’s consider the random variable $l_n^\prime(\theta_0)$. We have known that $\mathbb{E}_{\theta_0}(l_1^\prime(\theta_0)) = 0$ and $\mathbb{E}_{\theta_0}(l_1^\prime(\theta_0) l_1^\prime(\theta_0)) = \mathbb{E}_{\theta_0}(-\frac{\partial l_1(\theta_0^2)}{\partial \theta^2}) = i_1(\theta_0)$. Therefore, according to $CLT$, $\frac{\sum l_i^{\prime}(\theta_0)}{\sqrt{n i_1(\theta_0)}} \sim N(0, 1)$.

lemma 3.3.2: $\frac{l^{\prime\prime}_n(\theta_0)}{l^{\prime\prime}_n(\theta_n^*)} \xrightarrow{p} 1$

$\frac{l^{\prime\prime}_n(\theta_0)}{l^{\prime\prime}_n(\theta_n^*)} - 1 = \frac{l^{\prime\prime}_n(\theta_0) - l^{\prime\prime}_n(\theta_n^*)}{n} \cdot \Big \{\frac{l^{\prime\prime}_n(\theta_n^*)}{n}\Big\}^{-1}$

Note that $l^{\prime\prime}_i(\theta_n^*)$ is also a random variable, and it follows the SLLN. Therefore $\Big \{\frac{l^{\prime\prime}_n(\theta_n^*)}{n}\Big\}^{-1} \xrightarrow{a.s} -\frac{1}{i_1(\theta_0)}$

Further note that $|\frac{l^{\prime\prime}_n(\theta_0) - l^{\prime\prime}_n(\theta_n^*)}{n}| \le |\theta_n^* - \theta_0| \frac{\sum g(X_i)}{n}$, where $|\frac{\partial^3 \log f(x; \theta)}{\partial \theta^3}| \le g(x)$, i.e. we need further assume $|\frac{\partial^3 \log f(x; \theta)}{\partial \theta^3}|$ is uniformly bounded. Since we know the strong consistency, $|\theta_n^{*} - \theta_0| \xrightarrow{a.s.} 0$. Therefore, $\frac{l^{\prime\prime}_n(\theta_0)}{l^{\prime\prime}_n(\theta_n^*)} \xrightarrow{p} 1$

lemma 3.3.3: $\Big\{-\frac{l^{\prime\prime}_n(\theta_0)}{n i_1(\theta_0)} \Big\}^{-1} \xrightarrow{p} 1$

(自证不难)

Remark：基于score function良好的性质，以及CLT 和 Slutsky’s Lemma. 在做asymptotic的分析时应该多从score function上找。

Reference

Essentials of Statistics Inference

#Statistical Inference