Keywords: AD algorithm, evaluation trace, Dual number
1. Prerequisite
1.1 Numerical differentiation
According to Tayler extension with Lagrange reminder:
$$f(x + h) = f(x) + hf^{\prime}(x) + \frac{1}{2}h^2f^{\prime\prime}(\xi_1)$$
$$f(x - h) = f(x) - hf^{\prime}(x) + \frac{1}{2}h^2f^{\prime\prime}(\xi_2)$$
,where $\xi_{*} \in (0, h)$ or $(h, 0)$.
Remark
a) The error of forward (or backward) definition of the numerical differentiation is $O(h)$.
b) The error of $\frac{f(x + h) - f(x - h)}{2h}$ is $O(h^2)$. The difference of the reminders is $O(h^2)$.
1.2 Symbolic differentiation
Chains rule.
Note that AD is not numerical differentiation or symbolic differentiation.
2. Main modes in AD
All numerical computations are ultimately compositions of a finite set of elementary operations for which derivatives are known (Verma, 2000; Griewank and Walther, 2008).
Notation:
Let’s consider a function $f: \mathbb{R}^n \xrightarrow{} \mathbb{R}^m$
a) $v_{i - n} = x_{i}, \quad i = 1, \dots, n$ are the input variables.
b) $v_{i} \quad i = 1, \dots, l are the working variables$
c) $y_{m-i} = v_{l-i}, \quad i=m-1, \dots, 0$
Note that any numerical code will eventually result in a numerical evaluation trace. Therefore, the AD can differentiate not only the closed-formed expressions in the classical sense, but also the algorithms making use of control flow such as branching, loops, recursion and procedure.
Evaluation trace includes all the binary and unary operations that generate the function.
2.1 Forward Mode
Each forward pass of AD (or evaluation trace) is initialized by setting only one of the variables $\dot{x}_i = 1$ and setting the rest to zero. Therefore, a forward run will generate a column of Jacobian matrix.
2.1.1 Properties
a) Need n evaluations to compute the Jacobian matrix.
b) A very efficient method to computing Jacobian-vector products. (Simply by initializing with $\dot{x} = r$)
c) Efficient for $f: \mathbb{R} \xrightarrow{} \mathbb{R}^m$. NOT decent for $f: \mathbb{R}^m \xrightarrow{} \mathbb{R}$.
2.1.2 Dual numbers
Forward AD can be viewed as evaluating a function using dual numbers, which can be defined as truncated Taylor series of the form $v + \dot{v}\epsilon$, where $v, \dot{v} \in \mathbb{R}$ and $\epsilon$ is a nilpotent number such that $\epsilon^2 = 0$ and $\epsilon \neq 0$. With this notation, the operations between $v$ and $\dot{v}$ can be written as:
$v + \dot{v} \epsilon + u + \dot{u} \epsilon = (u + v) + (\dot{u} + \dot{v}) \epsilon$
$(v+\dot{v} \epsilon)(u + \dot{u}\epsilon) = vu + (v\dot{u} + \dot{v}u)\epsilon$
Note that the coefficient of nilpotent mirror the symbolic differentiation rules. We can further define that:
$f(v + \dot{v} \epsilon) = f(v) + f^{\prime}(v)\dot{v}\epsilon$.
Therefore, the coefficient of nilpotent can express the chain rules. Recall the differences between variables and constant in Pytorch or TF. The dual number of variables can be written as $v + 1\epsilon$, while the constant can be written as $v + 0 \epsilon$.
The chains rule can be expressed as:
$g(f(v + \dot{v}\epsilon)) = g(f(v) + f^{\prime}(v)\dot{v}\epsilon) = g(f(v)) + g^{\prime}(f(v))f^{\prime} \dot{v} \epsilon$
2.2 Reverse mode of AD
Reverse mode of AD correspond to a generalized backpropagation algorithm. The reverse mode propagates derivatives backward from a given output. In the Dual number model, we can compute the derivatives based on the value of intermediate variables. However, in the reverse mode of AD, we require the value of the output.
For each variables $v_i$, there is an adjoint variable $\bar{v_i}$, where $\bar{v_i}$ is defined as the sensitivity of an interested output $y$ wrt the changes in $v_i$.
$\bar{v_i} = \frac{\partial y}{\partial v_i}$
Note that we need a pre-defined function here, which means we can easily get the expression of $\bar{v_i}$. Clearly, this mode is much more quickly for a function $\mathbb{R}^m \xrightarrow{} \mathbb{R}^n$, where $m >> n$
Reference
Automatic Differentiation in Machine Learning: a Survey
Comments