02 - Mathematical Background
21 Mar 2014 Comments这篇笔记中给出了学习 Numerical Optimization 必备的一些数学概念和结论,关于一些 重要结论的证明单独在 Mathematical Background in Detail 中给出,具体的每个结论的证明会在结论后给出链接
Set
Difference between two sets A and B: $A \setminus B = \{x: x \in A \;\; and \;\; x \notin B\}$
Supremum
The smallest possible real number $y$ satisfying $x \leq y$ for every $x \in A$. Denoted as $\sup \{x : x \in A\}$.
Infimum
The greatest possible real number $y$ satisfying $x \geq y$ for every $x \in A$. Denoted as $\inf \\{x : x \in A\\}$.
For example:
$A = \{x: 1 \leq x \lt 3\}$
- $\sup\{x : x \in A\} = 3 (\notin A)$
- $\inf\{x : x \in A\} = 1 (\in A)$
Vector Space:
A nonempty set $S$ is called a vector space if
- For any $\b{x}, \b{y} \in S$, $\b{x} + \b{y}$ is defined and is in $S$. Further,
$$\b{x} + \b{y} = \b{y} + \b{x} \;\; \text{(commutativity)}$$ $$\b{x} + (\b{y} + \b{z}) = (\b{x} + \b{y}) + \b{z} \;\; \text{(associativity)}$$- There exists an element in $S$, $\b{0}$, such that $\b{x} + \b{0} = \b{0} + \b{x}$ for all $\b{x}$.
- For any $\b{x} \in S$, there exists $\b{y} \in S$, such that $\b{x} + \b{y} = \b{0}$.
- For any $\b{x} \in S$ and $\alpha \in \mathbb{R}$, $\alpha \b{x}$ is defined and is in $S$, Further,
$$\b{1}\b{x} = \b{x} \;\; \forall \b{x}$$- For any $\b{x}, \b{y} \in S$ and $\alpha, \beta \in \mathbb{R}$,
$$\alpha(\b{x} + \b{y}) = \alpha \b{x} + \alpha \b{y}$$ $$(\alpha + \beta)\b{x} = \alpha \b{x} + \beta \b{x}$$ $$\alpha(\beta \b{x}) = \alpha \beta \b{x}$$
Elements in S are called vector.
Notations:
$\mathbb{R}$: vector space of real numbers $\mathbb{R}^n$: vector space of real $n \times 1$ vectors n-vector $\b{x}$ is an array of n scalars, $x_1, x_2, ..., x_n$ $\b{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}$ $\b{x}^T = (x_1, x_2, ..., x_n)$ $\b{0}^T = (0, 0, ..., 0)$ $\b{1}^T = (1, 1, ..., 1)$ (We also use $\b{e}$ to denote this vector)
Vector Subspace:
If $S$ and $T$ are vector spaces so that $S \subseteq T$, then $S$ is called a subspace of $T$.
All possible subspaces of $\mathbb{R}^2$ are $\mathbb{R}^2$, $\{(0, 0)\}$, and $\{(x_1, x_2): a x_1 + b x_2 = 0, a \in \mathbb{R}, b \in \mathbb{R}, a^2 + b^2 \neq 0\}$ (which means all lines going through $(0, 0)$).
$\{(x_1, x_2): a x_1 + b x_2 = c, a \in \mathbb{R}, b \in \mathbb{R}, a^2 + b^2 \neq 0, c \neq 0\}$ is called affine space.
Spanning Set:
A set of vectors $\b{x}_1, \b{x}_2, ..., \b{x}_k$ is said to span the vector space $S$ if any vector $\b{x} \in S$ can be represented as: $$\b{x} = \sum_{i=1}^{k}\alpha_i \b{x}_i \;\; \alpha_i \in \mathbb{R}$$
Linear Independence:
A set of vectors $\b{x}_1, \b{x}_2, ..., \b{x}_k$ is said to linearly independent if $$\sum_{i=1}^{k} \alpha_i \b{x}_i = 0 \Rightarrow \alpha_i = 0 \;\; \forall i$$ Otherwise, they are linearly dependent, and one of them is linear combination of the others.
For example, $(1, 0)$ and $(1, 1)$ are linearly independent, while $(1, 0)$ and $((-1, 0))$ are not.
Basis
A set of vectors is said to be a basis for the vector space $S$ if it is linearly independent and spans $S$.
For example, $\{(1, 0), (1, 1)\}$ spans $\mathbb{R}^2$.
Properties of basis:
- A vector space does not have a unique basis.
- If $\b{x}_1, \b{x}_2, …, \b{x}_k$ is a basis for $S$, then any $\b{x} \in S$ can be uniquely represented using $\b{x}_1, \b{x}_2, …, \b{x}_k$.
- Any two basis of a vector space have the same cardinality.
- Let $\b{e}_i$ denote an n-dimensional vector whose i-th element is 1 and the remaining elements are 0’s. Then, the set $\b{e}_1, \b{e}_2, …, \b{e}_n$ forms a standard basis for $\mathbb{R}^n$.
- A basis for the vector space $S$ is a maximal independent set of vectors which spans the space $S$.
- A basis for the vector space $S$ is a minimum spanning set of vectors which spans the space $S$.
Property 2 can be easily proved by contradiction.
Dimension of Vector Space
The dimension of vector space $S$ is the cardinality of a basis of $S$.
The dimension of $\mathbb{R}^n$ is $n$.
Function
A function $f$ from a set $A$ to a set $B$ is a rule that assigns to each $x$ in $A$ a unique element $f(x)$ in $B$. This function can be represented by $$f:A\rightarrow B$$
Note:
- $A$: Domain of $f$
- $\{y \in B: \exists x[y = f(x)] \}$: Range of f
- Range of $f \subseteq B$
Norm
A norm on $\mathbb{R}^n$ is real-value function $\Vert \cdot \Vert: \mathbb{R}^n \rightarrow \mathbb{R}$ which obeys
- $\Vert \b{x} \Vert \geq 0$ for every $\b{x} \in \mathbb{R}^n$, and $\Vert \b{x} \Vert = 0$ iff $\b{x} = 0$.
- $\Vert \alpha \b{x} \Vert = |\alpha|\Vert \b{x} \Vert$ for every $\b{x} \in \mathbb{R}^n$ and $\alpha \in \mathbb{R}$.
- $\Vert \b{x} + \b{y}\Vert \leq \Vert \b{x} \Vert + \Vert \b{y} \Vert$ for every $\b{x} \in \mathbb{R}^n$ and $\b{y} \in \mathbb{R}^n$.
Let $\b{x} \in \mathbb{R}^n$. Some popular norms:
-
$L_2$ or Euclidean norm: $\Vert \b{x} \Vert_2 = (\sum_{i=1}^{n} (x_i)^2)^{\frac{1}{2}}$
-
$L_1$ norm: $\Vert \b{x} \Vert_1 = \sum_{i=1}^{n} x_i $ -
$L_{\infty}$ norm: $\Vert \b{x} \Vert_{\infty} = \underset{i=1,\cdots,n}{\max} x_i $
In general, $L_p$ norm is defined as $\Vert \b{x} \Vert_{p} = (\sum_{i=1}^{n} | x_i | ^p)^{\frac{1}{p}}$. |
If $\Vert \cdot \Vert_p$ and $\Vert \cdot \Vert_q$ are any two norms on $\mathbb{R}^n$, then there exist positive constants $\alpha$ and $\beta$ such that $$\alpha \Vert \b{x} \Vert_p \leq \Vert \b{x} \Vert_q \leq \beta \Vert \b{x} \Vert_p$$ for any $\b{x} \in \mathbb{R}^n$
So the convergece of an optimization algorithm does not depend on what norm its stopping criterion used.
Inner Product
Let $\b{x}, \b{y} \in \mathbb{R}^n$ and $\b{x} \neq \b{0} \neq \b{y}$. The inner or dot product is defined as $$\b{x} \cdot \b{y} \equiv \b{x}^T \b{y} = \sum_{i=1}^{n} x_i \cdot y_i = \Vert \b{x} \Vert \cdot \Vert \b{y} \Vert \cos \theta$$ where $\theta$ is the angle between $\b{x}$ and $\b{y}$.
Note:
- $\b{x}^T \b{x} = \Vert \b{x} \Vert ^2$
- $\b{x}^T \b{y} = \b{y}^T \b{x}$
-
$ \b{x} \cdot \b{y} \leq \Vert \b{x}\Vert \cdot \Vert \b{y}\Vert$ (Cauthy-Schwartz inequality)
Orthogonality
Let $\b{x} \in \mathbb{R}^n$ and $\b{y} \in \mathbb{R}^n$. $\b{x}$ and $\b{y}$ are said to perpendicular or orthogonal to each other if $\b{x}^T \b{y} = 0$. Two subspaces $S$ and $T$ of the same vector space $\mathbb{R}^n$ are orthogonal if every vector $\b{x} \in S$ is orthogonal to every vector $\b{y} \in T$, i.e. $\b{x}^T \b{y} = 0 \;\; \forall \b{x} \in S, \b{y} \in T$.
Mutual Orthogonality
Vectors $\b{x}_1, \b{x}_2, \cdots, \b{x}_k \in \mathbb{R}^n$ are said to be mutually orthogonal if $\b{x}_i \cdot \b{x}_j = 0$ for all $i \neq j$. If, in addition, $\Vert \b{x}_i \Vert = 1$ for every $i$, the set $\{ \b{x}_1, \cdots, \b{x}_k\}$ is said to be orthonormal.
It’s easy to show that if $\b{x}_1, \cdots, \b{x}_k$ are mutually orthogonal nonzero vectors, then they are linearly independent.
对于一个 squre matrix $Q$ [proof]
- $Q$ 的 column vector 是 orthogonal 不能推出 $Q$ 的 row vector 是 orthogonal
- $Q$ 的 column vector 是 orthonormal 等价于 $Q$ 的 row vetor 是 orthonormal
Gram-Schmidt Procedure
To produce an orthonormal basis with a given basis $\b{x}_1, \b{x}_2, \cdots, \b{x}_k \in \mathbb{R}^n$, we use Gram-Schmidt Procedure.
Given $\b{x}_1, \b{x}_2, \b{x}_3 \in \mathbb{R}^3$, to produce an orthonormal basis $\b{y}_1, \b{y}_2, \b{y}_3$.
-
set $\b{y}_1 = \frac{\b{x}_1}{\Vert \b{x}_1 \Vert}$
-
remove $\b{x}_2$’s component in the $\b{y}_1$ direction
now $\b{z}_2$ is orthogonal to $\b{x}_1$.
set $\b{y}_2 = \frac{\b{z}_2}{\Vert \b{z}_2 \Vert}$
-
remove $\b{x}_3$’s component in the $\b{y}_1$ and $\b{y}_2$ direction
now $\b{z}_3$ is orthogonal to $\b{x}_1$ and $\b{x}_2$.
set $\b{y}_3 = \frac{\b{z}_3}{\Vert \b{z}_3 \Vert}$
It’s easy to extend this procedure to $\mathbb{R}^n$.
Matrix
- Diagonal Matrix: A square matrix $\Lambda$ such that $\Lambda_{ij} = 0 \;\; i \neq j$.
- Identity Matrix: A diagonal matrix $I$ such that $I_{ii} = 1 \;\; \forall i$.
Determinant
From wikipedia:
Let $A$ be $n\times n$ matrix. There are various way to define the determinant of a square matrix. Perhaps the most natrual way is expressed in terms of the columns of the matrix. If we write an $n\times n$ matrix in terms of its column vectors $$A = [\b{a}_1, \b{a}_2, \cdots, \b{a}_n]$$ where $\b{a}_i$ are vectors of size n, then the determinant of $A$ is defined so that $$\det[\b{a}_1, \cdots, b \b{a}_j + c \b{v}, \cdots, \b{a}_n] = b \det(A) + c \det[\b{a}_1, \cdots, \b{v}, \cdots, \b{a}_n]$$ $$\det[\b{a}_1, \cdots, \b{a}_j, \b{a}_{j+1}, \cdots, \b{a}_n] = -\det[\b{a}_1, \cdots, \b{a}_{j+1}, \b{a}_j, \cdots, \b{a}_n]$$ $$\det(I) = 1$$
- if $A$ contains two identical columns, $\det(A) = 0$
- $\det(AB) = \det(A)\det(B)$ [proof]
Rank
Let $A \in \mathbb{R}^{m\times n}$.
The subspace of $\mathbb{R}^m$, spanned by the column vectors of $A$ is called the column space of $A$. The subspace of $\mathbb{R}^n$, spanned by the row vectors of $A$ is called the row space of $A$.
Column Rank: The dimension of the column space.
Row Rank: The dimension of the row space.
- The column rank of a matrix equals its row rank, and its common value is called the rank of $A$. [proof]
- The rank of a matrix is 0 iff it’s a zero matrix.
-
Matrices with the smallest rank - rank one matrices
Example:
Every matrix of rank one has the simplest form $A = \b{u}\b{v}^T$.
Invertible
A square matrix $A$ is said to be invertible if there exists a matrix $B$ such that $AB = BA = I$. There's at most one such $B$ and is denoted by $A^{-1}$.
- If $A, B$ are square matrices, then $AB = I \Ra BA = I$ [proof]
- A product of invertible matrices is invertible and $(AB)^{-1} = B^{-1}A^{-1}$
- If $\det(A) \neq 0$, $A$ is invertible. ($\det(A)$ is the determinant of $A$)
- A matrix $A$ is called an orthogonal matrix if $A^{-1} = A^T$. So the column(row) vectors of an orthogonal matrix form orthonormal basis
Matrix-vector Multiplication
The effect of matrix-vector multiplication (here, matrix is square matrix) is rotating and/or scaling the vector. For example, $\begin{pmatrix} 3 & 2 \\ 2 & 0 \end{pmatrix} \times \begin{pmatrix} 1 \\ 1 \end{pmatrix} = \begin{pmatrix} 5 \\ 2 \end{pmatrix}$
Sometimes matrix-vector multiplication only scales the vector without changing the direction (changing to opposite direction is allowed), and this leads to the important concept of eigenvector and eigenvalue. For example, $\begin{pmatrix} 3 & 2 \\ 2 & 0 \end{pmatrix} \times \begin{pmatrix} 2 \\ 1 \end{pmatrix} = \begin{pmatrix} 8 \\ 4 \end{pmatrix} = 4 \times \begin{pmatrix} 2 \\ 1 \end{pmatrix}$
Eigenvalues and Eigenvectors
Let $A \in \mathbb{R}^{n\times n}$. The eigenvalues and eigenvectors of $A$ are the real or complex scalars $\lambda$ and n-dimensional vectors $\b{x}$ such that $$A\b{x} = \lambda \b{x} \;\; \b{x} \neq \b{0}$$
-
$A\b{x} = \lambda \b{x} \Rightarrow (A - \lambda I)\b{x} = 0$
This implies that $A - \lambda I$ is not full-rank, so $\det(A - \lambda I) = 0$. This is called the characteristic equation of $A$. This equation is a polynomial of degree n, so it has n roots and are called the eigenvalues of $A$. All the n roots need not be real roots, some of them could be complex root.
-
If the eigenvalue $\lambda$ equals 0 then $A\b{x} = 0\b{x} = 0$. Vectors with eigenvalue 0 make up the nullspace of $A$
$A$ is singular iff $\lambda = 0$ is an eigenvalue of $A$.
-
$\det(A) = \prod_{i = 1}^{n} \lambda_i$
The matrix $A$ is invertible if and only if all the eigenvalues $\lambda_i$ are nonzero.
Symmetric Matrix
Let $A \in \mathbb{R}^{n\times n}$. The matrix $A$ is said to be symmetric if $A^T = A$.
Let $A \in \mathbb{R}^{n\times n}$ be symmetric, then
- $A$ has n real eigenvalues $\lambda_1, \lambda_2, \cdots, \lambda_n$ [proof]
- eigenvectors of different eigenvalues are orthogonal [proof]
- If $\lambda_i$ is a repeated root with multiplicity m >= 2, then there exist m orthonormal eigenvectors corresponding to $\lambda_i$ [proof]
-
a corresponding set of eigenvectors ${ \b{x}_1, \b{x}_2, \cdots, \b{x}_n}$ can be chosen to be orthonormal. Let $S = (\b{x}_1, \b{x}_2, \cdots, \b{x}_n)$, $S$ is an orthogonal matrix
Note that a matrix is diagonalizable is not the same as a matrix is invertible, for example, $\bigl(\begin{smallmatrix} 0 & 0 \\\\ 0 & 0 \end{smallmatrix}\bigr)$ is diagonalizable (it's already diagonalized), but it's apparently not invertible
Quadratic Form
Let $A \in \mathbb{R}^{n\times n}$ be a symmetric matrix, consider $f(x) = \b{x}^T A \b{x}$, a pure quadratic form. (Note that $\b{x}^T A \b{x} = \sum_i \sum_j x_i A_{ij} x_j$, so all quadratic functions can be expressed as $\b{x}^T A \b{x}$)
$A$ is said to be | if |
---|---|
positive definite | $\b{x}^T A \b{x} \gt 0$ for every nonzero $\b{x} \in \mathbb{R}^n$ |
positive semi-definite | $\b{x}^T A \b{x} \geq 0$ for every $\b{x} \in \mathbb{R}^n$ |
negative definite | $\b{x}^T A \b{x} \lt 0$ for every nonzero $\b{x} \in \mathbb{R}^n$ |
negative semi-definite | $\b{x}^T A \b{x} \leq 0$ for every $\b{x} \in \mathbb{R}^n$ |
indefinite | $A$ is neither positive definite nor negative definite |
- $A$ is positive definite iff all its eigenvalues are positive [proof]
- $A$ is indefinite iff it has both positive and negative eigenvalues
-
For a symmetric positive definite matrix $A$, we can define its square root as $A^{1/2} = S \Lambda^{1/2} S^T$, in which
then $A = A^{1/2}A^{1/2}$ which is easy to prove.
Solution of Ax=b
Let $A \in \mathbb{R}^{n\times n}$, symmetric and positive definite
-
Solution of $A\b{x}=\b{b}$ is $\b{x}^{*} = A^{-1} \b{b}$
-
since $A$ is symmetric and positive definite
so $A$ is invertible
-
However, the inverse operation is not numerically stable
-
-
Instead, use Cholesky decomposition $A = L L^T$
-
The given system of equations is $L L^T \b{x} = \b{b}$
-
Solve the triangular system. $L\b{y} = \b{b}$ using forward substitution to get $\b{y}$.
-
Solve the triangular $L^T \b{x} = \b{y}$ using backward substitution to get $\b{x}^{*}$.
Cholesky decomposition is numerically stable.
Every Hermitian positive-definite matrix (and thus also every real-valued symmetric positive-definite matrix) has a unique Cholesky decomposition, and $L$ is a lower triangular matrix with real and positive diagonal entries.
-
Interval
Let $a, b \in \mathbb{R}$. The closed interval $[a, b]$ denotes the set, $\{ x \in \mathbb{R}: a \leq x \leq b \}$. The set $\{ x \in \mathbb{R}: a \lt x \lt b\}$ represents the open interval $(a, b)$.
Norm ball
Let $\b{x}_0 \in \mathbb{R}^n$. A norm ball of radius $r > 0$ and centre $\b{x}_0$ is given by $\{\b{x} \in \mathbb{R}^n: \Vert \b{x} - \b{x}_0 \Vert\ \leq r \}$ and will be denoted by $B[\b{x}_0, r]$.
We will use $B(\b{x}_0, r)$ to denote $\{\b{x} \in \mathbb{R}^n: \Vert \b{x} - \b{x}_0 \Vert\ \lt r \}$
Open/Close set
Let $\b{x} \in S \subseteq \mathbb{R}^n$. $\b{x}$ is called an interior point of $S$ if there exists $r \gt 0$ such that $B[\b{x}, r] \subseteq S$. The set of all such points interior to $S$ is called the interior of $S$ and is denoted by $int(S)$.
Let $S \subset \mathbb{R}^n$. $\b{x} \in \mathbb{R}^n$ belongs to the closure of $S$, $cl(S)$ if for each $\varepsilon \gt 0$, $S \cap B[\b{x}, \varepsilon] \neq \emptyset$
Example:
Let $S = (1, 2] \cup [3, 4)$. Then $cl(S) = [1, 2] \cup [3, 4]$ and $int(S) = (1, 2) \cup (3, 4)$.
A set $S \subseteq \mathbb{R}^n$ is said to be an open set if $S = int(S)$.
A set $S \subseteq \mathbb{R}^n$ is said to be an closed set if $S = cl(S)$.
Example:
- open set: $B(0, r)$, $(1, 2) \cup (3, 4)$
- closed set: $[1, 2] \cup [3, 4]$
Bounded and compact
The boundary of a set S is defined as $bd(S) = cl(S)\setminus int(S)$.
A set $S \subset \mathbb{R}^n$ is said to be bounded if there exist $R \; (0 \lt R \lt \infty)$ and $\b{x} \in \mathbb{R}^n$, such that $S \subset B(\b{x}, R)$.
Example:
- $(1, 2] \cup [3, 100)$ is a bounded set
- $(0, \infty)$ is not a bounded set
A set $S \subset \mathbb{R}^n$ is said to be compact if it's closed and bounded.
Example:
- $[0, 100] \cup [1000, 10000]$ is a compact set
Sequences
Let $S \subseteq \mathbb{R}^n$ and ${ \b{x}^k }$ be a sequence of points belonging to $S$.
A sequence $\{ \b{x}^k \}$ converges to $\b{x}^{*}$, if for any given $\varepsilon \gt 0$, there is a positive integer $K$ such that $$ \Vert \b{x}^k - \b{x}^* \Vert \leq \varepsilon \;\; \forall k \geq K$$ We write this as $\b{x}^k \rightarrow \b{x}^*$ or $\lim_{k \rightarrow \infty} \b{x}^k = \b{x}^*$.
Continuous Function
Let $S \subseteq \mathbb{R}^n$. A function $f: S\rightarrow \mathbb{R}$ is said to be continuous at $\bar{\b{x}} \in S$ if for any given $\varepsilon \gt 0$ there exists a $\delta \gt 0$ such that $\b{x} \in S$ and $\Vert \b{x} - \bar{\b{x}} \Vert \lt \delta$ implies that $|f(\b{x}) - f(\bar{\b{x}})| \lt \varepsilon$.
- $f$ is said to be continuous on $A \subset \mathbb{R}^n$ if it’s continuous at each point of $A$
- When we say $f$ is continuous, we mean that it’s continuous on its domain
- $\mathcal{C}$: Class of all continuous functions
Example:
Lipschitz Continuous
Let $S \subseteq \mathbb{R}^n$. A function $f: S \rightarrow \mathbb{R}^m$ is said to be lipschitz continuous if $$ \Vert f(\b{x}_1) - f(\b{x}_2) \Vert \leq L \Vert \b{x}_1 - \b{x}_2 \Vert \;\; \forall \b{x}_1, \b{x}_2 \in S$$ where $L$ is a positive constant
所以 lipschitz continuous 给任意两点函数值的差异加了一个上限
Gradient
A continuous function, $f: \mathbb{R}^n \rightarrow \mathbb{R}$, is said to be continuously differentiable at $\b{x} \in \mathbb{R}^n$, if $\frac{\p f}{\p x_i}(\b{x})$ exists and is continuous, $i = 1, 2, \cdots, n$.
- $\mathcal{C}^1$: Class of functions whose first p derivatives are continuous
- Assumption: $f \in \mathcal{C}^1$
We define the gradient of $f$ at $\b{x}$ to be the vector $$g(\b{x}) = \nabla f(\b{x}) = (\frac{\p f(\b{x})}{\p x_1}, \frac{\p f(\b{x})}{\p x_2}, \cdots, \frac{\p f(\b{x})}{\p x_n})^T$$
-
Gradient is perpenticular to level surface (contour line) [proof]
Directional Derivatives
Let $S \subset \mathbb{R}^n$ be an open set and $f: S \rightarrow \mathbb{R}$, $f$ continuously differentiable in $S$. Then, for any $\b{x} \in S$ and any nonzero $\b{d} \in \mathbb{R}^n$, the directional derivative of $f$ at $\b{x}$ in the direction of $\b{d}$, defined by $$\frac{\p f}{\p \b{d}}(\b{x}) = \lim_{\varepsilon \rightarrow 0} \frac{f(\b{x} + \varepsilon \b{d}) - f(\b{x})}{\varepsilon}$$ exists and equals $\nabla f(\b{x})^T \b{d}$.
For any unit direction $\b{d}$, the rate of change is $\nabla f(x)^T \b{d} = \Vert \nabla f(x)\Vert \Vert d \Vert \cos\theta = \Vert \nabla f(x) \Vert \cos\theta$, where $\theta$ is the angle between $\b{d}$ and $\nabla f(x)$. When $\theta = 0$ the rate of change is maximized. This is why the gradient is the direction of steepest descent/ascent.
Hessian
A continuously differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is said to be twice continuously differentiable at $\b{x} \in \mathbb{R}^n$, if $\frac{\p^2 f}{\p x_i \p x_j} (\b{x})$ exists and is continuous.
- $\mathcal{C}^2$: Class of twice continuously differentiable functions
- saying “hessian is continuous” is equivalent to $f \in \mathcal{C}^2$
Let $f \in \mathcal{C}^2$. We denote the Hessian of $f$ at $\b{x}$ to be the matrix $$ H(x) = \nabla^2 f(\b{x}) = \begin{pmatrix} \frac{\p^2 f}{\p x_1^2} & \frac{\p^2 f}{\p x_1 \p x_2} & \cdots & \frac{\p^2 f}{\p x_1 \p x_n} \\\\ \frac{\p^2 f}{\p x_2 \p x_1} & \frac{\p^2 f}{\p x_2^2} & \cdots & \frac{\p^2 f}{\p x_2 \p x_n} \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \frac{\p^2 f}{\p x_n \p x_1} & \frac{\p^2 f}{\p x_n \p x_2} & \cdots & \frac{\p^2 f}{\p x_n^2} \end{pmatrix} $$
- Hessian matrix is symmetric
Let $S \subset \mathbb{R}^n$ be an open set and $f: S \rightarrow \mathbb{R}$, $f$ twice continuously differentiable in $S$. Then, for any $\b{x} \in S$ and any nonzero $\b{d} \in \mathbb{R}^n$, the second directional derivative of $f$ at $\b{x}$ in the direction of $\b{d}$ equals $\b{d}^T \nabla^2 f(\b{x}) \b{d}$.
Taylor Series
$\mathcal{C}^\infty$: Class of all functions for which the derivative of any order is continuous.
Let $f: \mathbb{R} \rightarrow \mathbb{R}. f \in \mathcal{C}^\infty.$
Let $x^0$ be the point about which we write the Taylor series.
Suppose we use only $f’(x^0)$. Then $f(x)$ at $x^0$ can be approximated by
Similarly, using $f’(x^0)$ and $f’‘(x^0)$, then the quadratic approximation of $f$ at $x^0$ is
Truncated Taylor Series
-
first order
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}, f \in \mathcal{C}^1, \b{x}^0 \in \mathbb{R}^n$.
Then, for every $\b{x} \in \mathbb{R}^n$,
where $\bar{\b{x}}$ is some point that lies on the line segment joining $\b{x}$ and $\b{x}^0$: $\bar{\b{x}}$ depends on $\b{x}, \b{x}^0$ and $f$.
-
second order
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}, f \in \mathcal{C}^2, \b{x}^0 \in \mathbb{R}^n$.
Then, for every $\b{x} \in \mathbb{R}^n$,
where $\bar{\b{x}}$ is some point that lies on the line segment joining $\b{x}$ and $\b{x}^0$: $\bar{\b{x}}$ depends on $\b{x}, \b{x}^0$ and $f$.
上面两个式子中最后一项被称为 Lagrange remainder。因为我们处理的都是多变量的情况,上面我们用矩阵乘法的形式表示,但是到了更高阶的近似就不好再用矩阵表示了,以 3 阶为例,展开的 taylor series 是这样的
Matrix Derivate
下面以小写字符 $x$ 表示 scalar,粗体的小写字符 $\b{x}$ 表示 vector, 以大写字符 $X$ 表示 matrix
基于 matrix 或 vector 的 derivative 结果可以有不同的表示方式,比如一个 vector $\b{y} \in \mathbb{R}^m$ 对另一个 vector $\b{x} \in \mathbb{R}^n$ 求导,其结果 可以是个 $m\times n$ matrix,也可以是个 $n\times m$ matrix,前者被称为 Numerator-layout,后者被称为 Denominator-layout,下面我们看看这两种不同的 layout 的主要区别
Numerator-layout | Denominator-layout | |
$\frac{\p y}{\p \b{x}}$ scalar by vector |
$\begin{pmatrix} \frac{\p y}{\p x_1} & \frac{\p y}{\p x_2} & \cdots & \frac{\p y}{\p x_n} \end{pmatrix}$ | $\begin{pmatrix} \frac{\p y}{\p x_1} \\\\[-1em] \frac{\p y}{\p x_2} \\\\[-1em] \vdots \\\\[-1em] \frac{\p y}{\p x_n} \end{pmatrix}$ |
$\frac{\p \b{y}}{\p x}$ vector by scalar |
$\begin{pmatrix} \frac{\p y_1}{\p x} \\\\[-1em] \frac{\p y_2}{\p x} \\\\[-1em] \vdots \\\\[-1em] \frac{\p y_n}{\p x} \end{pmatrix}$ | $\begin{pmatrix} \frac{\p y_1}{\p x} & \frac{\p y_2}{\p x} & \cdots & \frac{\p y_n}{\p x} \end{pmatrix}$ |
$\frac{\p \b{y}}{\p \b{x}}$ vector by vector |
$\begin{pmatrix} \frac{\p y_1}{\p x_1} & \frac{\p y_1}{\p x_2} & \cdots & \frac{\p y_1}{\p x_n} \\\\[-1em] \frac{\p y_2}{\p x_1} & \frac{\p y_2}{\p x_2} & \cdots & \frac{\p y_2}{\p x_n} \\\\[-1em] \vdots & \vdots & \ddots & \vdots \\\\[-1em] \frac{\p y_m}{\p x_1} & \frac{\p y_m}{\p x_2} & \cdots & \frac{\p y_m}{\p x_n} \end{pmatrix}$ | $\begin{pmatrix} \frac{\p y_1}{\p x_1} & \frac{\p y_2}{\p x_1} & \cdots & \frac{\p y_m}{\p x_1} \\\\[-1em] \frac{\p y_1}{\p x_2} & \frac{\p y_2}{\p x_2} & \cdots & \frac{\p y_m}{\p x_2} \\\\[-1em] \vdots & \vdots & \ddots & \vdots \\\\[-1em] \frac{\p y_1}{\p x_n} & \frac{\p y_2}{\p x_n} & \cdots & \frac{\p y_m}{\p x_n} \end{pmatrix}$ |
所以二者的区别就是对于 Numerator-layout,其行号由 Numerator 决定,而对于 Denominator-layout,其行号由 Denominator 决定
目前我比较常遇到的是 Denominator-layout,所以就以 Denominator-layout 的方式来表示 matrix derivative
the columns of a matrix $A \in \mathbb{R}^{m\times n}$ are $a_1$ through $a_n$, while the rows are $\tilde{a}_1^T$ through $\tilde{a}_n^T$.
-
Derivative of $A\b{x}$, where $A \in \mathbb{R}^{m\times n}$ and $\b{x} \in \mathbb{R}^n$
-
Derivative of linear transformed input to function
Consider a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$. Suppose we have a matrix $A \in \mathbb{R}^{n\times m}$ and a vector $\b{x} \in \mathbb{R}^m$. We wish to compute $\nabla_{\b{x}} f(A\b{x})$.
Note that $f(A\b{x})$ is a composition of multivariate functions, so we can apply the following chain rule to calculate derivate
Suppose $u = f(x_1, x_2, \cdots, x_n)$ is a differentiable function of variables $x_1, x_2, \cdots, x_n$, and all functions $g_j(t_1, t_2, \cdots, t_m), j = 1, \cdots, m$ are also differentiable, then their composition $u = f(g_1, g_2, \cdots, g_n)$ as a function of $t_1, t_2, \cdots, t_m$ is also differentiable, and $$ \frac{\p u}{\p t_j} = \sum_{k=1}^{n} \frac{\p f}{\p x_k} \frac{\p g_k}{\p t_j} \;\;\; j = 1, \cdots m$$
By the chain rule, we have
so
Reference: berkeley.edu, okstate.edu