02 - Mathematical Background

21 Mar 2014 Comments

这篇笔记中给出了学习 Numerical Optimization 必备的一些数学概念和结论，关于一些重要结论的证明单独在 Mathematical Background in Detail 中给出，具体的每个结论的证明会在结论后给出链接

Set

Difference between two sets A and B: $A \setminus B = \{x: x \in A \;\; and \;\; x \notin B\}$

Supremum

The smallest possible real number $y$ satisfying $x \leq y$ for every $x \in A$. Denoted as $\sup \{x : x \in A\}$.

Infimum

The greatest possible real number $y$ satisfying $x \geq y$ for every $x \in A$. Denoted as $\inf \\{x : x \in A\\}$.

For example:

$A = \{x: 1 \leq x \lt 3\}$

$\sup\{x : x \in A\} = 3 (\notin A)$
$\inf\{x : x \in A\} = 1 (\in A)$

Vector Space:

A nonempty set $S$ is called a vector space if

For any $\b{x}, \b{y} \in S$, $\b{x} + \b{y}$ is defined and is in $S$. Further,
$$\b{x} + \b{y} = \b{y} + \b{x} \;\; \text{(commutativity)}$$ $$\b{x} + (\b{y} + \b{z}) = (\b{x} + \b{y}) + \b{z} \;\; \text{(associativity)}$$
There exists an element in $S$, $\b{0}$, such that $\b{x} + \b{0} = \b{0} + \b{x}$ for all $\b{x}$.

For any $\b{x} \in S$, there exists $\b{y} \in S$, such that $\b{x} + \b{y} = \b{0}$.

For any $\b{x} \in S$ and $\alpha \in \mathbb{R}$, $\alpha \b{x}$ is defined and is in $S$, Further,
$$\b{1}\b{x} = \b{x} \;\; \forall \b{x}$$
For any $\b{x}, \b{y} \in S$ and $\alpha, \beta \in \mathbb{R}$,
$$\alpha(\b{x} + \b{y}) = \alpha \b{x} + \alpha \b{y}$$ $$(\alpha + \beta)\b{x} = \alpha \b{x} + \beta \b{x}$$ $$\alpha(\beta \b{x}) = \alpha \beta \b{x}$$

Elements in S are called vector.

Notations:

$\mathbb{R}$: vector space of real numbers

$\mathbb{R}^n$: vector space of real $n \times 1$ vectors

n-vector $\b{x}$ is an array of n scalars, $x_1, x_2, ..., x_n$

$\b{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}$

$\b{x}^T = (x_1, x_2, ..., x_n)$

$\b{0}^T = (0, 0, ..., 0)$

$\b{1}^T = (1, 1, ..., 1)$ (We also use $\b{e}$ to denote this vector)

Vector Subspace:

If $S$ and $T$ are vector spaces so that $S \subseteq T$, then $S$ is called a subspace of $T$.

All possible subspaces of $\mathbb{R}^2$ are $\mathbb{R}^2$, $\{(0, 0)\}$, and $\{(x_1, x_2): a x_1 + b x_2 = 0, a \in \mathbb{R}, b \in \mathbb{R}, a^2 + b^2 \neq 0\}$ (which means all lines going through $(0, 0)$).

$\{(x_1, x_2): a x_1 + b x_2 = c, a \in \mathbb{R}, b \in \mathbb{R}, a^2 + b^2 \neq 0, c \neq 0\}$ is called affine space.

Spanning Set:

A set of vectors $\b{x}_1, \b{x}_2, ..., \b{x}_k$ is said to span the vector space $S$ if any vector $\b{x} \in S$ can be represented as: $$\b{x} = \sum_{i=1}^{k}\alpha_i \b{x}_i \;\; \alpha_i \in \mathbb{R}$$

Linear Independence:

A set of vectors $\b{x}_1, \b{x}_2, ..., \b{x}_k$ is said to linearly independent if $$\sum_{i=1}^{k} \alpha_i \b{x}_i = 0 \Rightarrow \alpha_i = 0 \;\; \forall i$$ Otherwise, they are linearly dependent, and one of them is linear combination of the others.

For example, $(1, 0)$ and $(1, 1)$ are linearly independent, while $(1, 0)$ and $((-1, 0))$ are not.

Basis

A set of vectors is said to be a basis for the vector space $S$ if it is linearly independent and spans $S$.

For example, $\{(1, 0), (1, 1)\}$ spans $\mathbb{R}^2$.

Properties of basis:

A vector space does not have a unique basis.
If $\b{x}_1, \b{x}_2, …, \b{x}_k$ is a basis for $S$, then any $\b{x} \in S$ can be uniquely represented using $\b{x}_1, \b{x}_2, …, \b{x}_k$.
Any two basis of a vector space have the same cardinality.
Let $\b{e}_i$ denote an n-dimensional vector whose i-th element is 1 and the remaining elements are 0’s. Then, the set $\b{e}_1, \b{e}_2, …, \b{e}_n$ forms a standard basis for $\mathbb{R}^n$.
A basis for the vector space $S$ is a maximal independent set of vectors which spans the space $S$.
A basis for the vector space $S$ is a minimum spanning set of vectors which spans the space $S$.

Property 2 can be easily proved by contradiction.

Dimension of Vector Space

The dimension of vector space $S$ is the cardinality of a basis of $S$.

The dimension of $\mathbb{R}^n$ is $n$.

Function

A function $f$ from a set $A$ to a set $B$ is a rule that assigns to each $x$ in $A$ a unique element $f(x)$ in $B$. This function can be represented by $$f:A\rightarrow B$$

Note:

$A$: Domain of $f$
$\{y \in B: \exists x[y = f(x)] \}$: Range of f
Range of $f \subseteq B$

Norm

A norm on $\mathbb{R}^n$ is real-value function $\Vert \cdot \Vert: \mathbb{R}^n \rightarrow \mathbb{R}$ which obeys

$\Vert \b{x} \Vert \geq 0$ for every $\b{x} \in \mathbb{R}^n$, and $\Vert \b{x} \Vert = 0$ iff $\b{x} = 0$.

$\Vert \alpha \b{x} \Vert = |\alpha|\Vert \b{x} \Vert$ for every $\b{x} \in \mathbb{R}^n$ and $\alpha \in \mathbb{R}$.

$\Vert \b{x} + \b{y}\Vert \leq \Vert \b{x} \Vert + \Vert \b{y} \Vert$ for every $\b{x} \in \mathbb{R}^n$ and $\b{y} \in \mathbb{R}^n$.

Let $\b{x} \in \mathbb{R}^n$. Some popular norms:

$L_2$ or Euclidean norm: $\Vert \b{x} \Vert_2 = (\sum_{i=1}^{n} (x_i)^2)^{\frac{1}{2}}$
$L_1$ norm: $\Vert \b{x} \Vert_1 = \sum_{i=1}^{n} x_i $
$L_{\infty}$ norm: $\Vert \b{x} \Vert_{\infty} = \underset{i=1,\cdots,n}{\max} x_i $

In general, $L_p$ norm is defined as $\Vert \b{x} \Vert_{p} = (\sum_{i=1}^{n}

x_i

^p)^{\frac{1}{p}}$.

If $\Vert \cdot \Vert_p$ and $\Vert \cdot \Vert_q$ are any two norms on $\mathbb{R}^n$, then there exist positive constants $\alpha$ and $\beta$ such that $$\alpha \Vert \b{x} \Vert_p \leq \Vert \b{x} \Vert_q \leq \beta \Vert \b{x} \Vert_p$$ for any $\b{x} \in \mathbb{R}^n$

So the convergece of an optimization algorithm does not depend on what norm its stopping criterion used.

Inner Product

Let $\b{x}, \b{y} \in \mathbb{R}^n$ and $\b{x} \neq \b{0} \neq \b{y}$. The inner or dot product is defined as $$\b{x} \cdot \b{y} \equiv \b{x}^T \b{y} = \sum_{i=1}^{n} x_i \cdot y_i = \Vert \b{x} \Vert \cdot \Vert \b{y} \Vert \cos \theta$$ where $\theta$ is the angle between $\b{x}$ and $\b{y}$.

Note:

$\b{x}^T \b{x} = \Vert \b{x} \Vert ^2$
$\b{x}^T \b{y} = \b{y}^T \b{x}$
$ \b{x} \cdot \b{y} \leq \Vert \b{x}\Vert \cdot \Vert \b{y}\Vert$ (Cauthy-Schwartz inequality)

Orthogonality

Let $\b{x} \in \mathbb{R}^n$ and $\b{y} \in \mathbb{R}^n$. $\b{x}$ and $\b{y}$ are said to perpendicular or orthogonal to each other if $\b{x}^T \b{y} = 0$. Two subspaces $S$ and $T$ of the same vector space $\mathbb{R}^n$ are orthogonal if every vector $\b{x} \in S$ is orthogonal to every vector $\b{y} \in T$, i.e. $\b{x}^T \b{y} = 0 \;\; \forall \b{x} \in S, \b{y} \in T$.

Mutual Orthogonality

Vectors $\b{x}_1, \b{x}_2, \cdots, \b{x}_k \in \mathbb{R}^n$ are said to be mutually orthogonal if $\b{x}_i \cdot \b{x}_j = 0$ for all $i \neq j$. If, in addition, $\Vert \b{x}_i \Vert = 1$ for every $i$, the set $\{ \b{x}_1, \cdots, \b{x}_k\}$ is said to be orthonormal.

It’s easy to show that if $\b{x}_1, \cdots, \b{x}_k$ are mutually orthogonal nonzero vectors, then they are linearly independent.

对于一个 squre matrix $Q$ [proof]

$Q$ 的 column vector 是 orthogonal 不能推出 $Q$ 的 row vector 是 orthogonal
$Q$ 的 column vector 是 orthonormal 等价于 $Q$ 的 row vetor 是 orthonormal

Gram-Schmidt Procedure

To produce an orthonormal basis with a given basis $\b{x}_1, \b{x}_2, \cdots, \b{x}_k \in \mathbb{R}^n$, we use Gram-Schmidt Procedure.

Given $\b{x}_1, \b{x}_2, \b{x}_3 \in \mathbb{R}^3$, to produce an orthonormal basis $\b{y}_1, \b{y}_2, \b{y}_3$.

set $\b{y}_1 = \frac{\b{x}_1}{\Vert \b{x}_1 \Vert}$
remove $\b{x}_2$’s component in the $\b{y}_1$ direction
$\b{z}_2 = \b{x}_2 - (\b{x}_2^T \b{y}_1)\b{y}_1$
now $\b{z}_2$ is orthogonal to $\b{x}_1$.

set $\b{y}_2 = \frac{\b{z}_2}{\Vert \b{z}_2 \Vert}$
remove $\b{x}_3$’s component in the $\b{y}_1$ and $\b{y}_2$ direction
$\b{z}_3 = \b{x}_3 - (\b{x}_3^T \b{y}_2)\b{y}_2 - (\b{x}_3^T \b{y}_1)\b{y}_1$
now $\b{z}_3$ is orthogonal to $\b{x}_1$ and $\b{x}_2$.

set $\b{y}_3 = \frac{\b{z}_3}{\Vert \b{z}_3 \Vert}$

It’s easy to extend this procedure to $\mathbb{R}^n$.

Matrix

Diagonal Matrix: A square matrix $\Lambda$ such that $\Lambda_{ij} = 0 \;\; i \neq j$.
Identity Matrix: A diagonal matrix $I$ such that $I_{ii} = 1 \;\; \forall i$.

Determinant

From wikipedia:

Let $A$ be $n\times n$ matrix. There are various way to define the determinant of a square matrix. Perhaps the most natrual way is expressed in terms of the columns of the matrix. If we write an $n\times n$ matrix in terms of its column vectors $$A = [\b{a}_1, \b{a}_2, \cdots, \b{a}_n]$$ where $\b{a}_i$ are vectors of size n, then the determinant of $A$ is defined so that $$\det[\b{a}_1, \cdots, b \b{a}_j + c \b{v}, \cdots, \b{a}_n] = b \det(A) + c \det[\b{a}_1, \cdots, \b{v}, \cdots, \b{a}_n]$$ $$\det[\b{a}_1, \cdots, \b{a}_j, \b{a}_{j+1}, \cdots, \b{a}_n] = -\det[\b{a}_1, \cdots, \b{a}_{j+1}, \b{a}_j, \cdots, \b{a}_n]$$ $$\det(I) = 1$$

if $A$ contains two identical columns, $\det(A) = 0$
$\det(AB) = \det(A)\det(B)$ [proof]

Rank

Let $A \in \mathbb{R}^{m\times n}$.

The subspace of $\mathbb{R}^m$, spanned by the column vectors of $A$ is called the column space of $A$. The subspace of $\mathbb{R}^n$, spanned by the row vectors of $A$ is called the row space of $A$.

Column Rank: The dimension of the column space.
Row Rank: The dimension of the row space.

The column rank of a matrix equals its row rank, and its common value is called the rank of $A$. [proof]
The rank of a matrix is 0 iff it’s a zero matrix.
Matrices with the smallest rank - rank one matrices

Example:
$\begin{pmatrix} 3 & 1 & -1 \\\\ -3 & -1 & 1 \\\\ 6 & 2 & -2 \\\\ \end{pmatrix} = \begin{pmatrix} 1 \\\\ -1 \\\\ 2 \end{pmatrix} \begin{pmatrix} 3 & 1 & -1 \end{pmatrix} = \b{u}\b{v}^T$
Every matrix of rank one has the simplest form $A = \b{u}\b{v}^T$.

Invertible

A square matrix $A$ is said to be invertible if there exists a matrix $B$ such that $AB = BA = I$. There's at most one such $B$ and is denoted by $A^{-1}$.

If $A, B$ are square matrices, then $AB = I \Ra BA = I$ [proof]
A product of invertible matrices is invertible and $(AB)^{-1} = B^{-1}A^{-1}$
If $\det(A) \neq 0$, $A$ is invertible. ($\det(A)$ is the determinant of $A$)
A matrix $A$ is called an orthogonal matrix if $A^{-1} = A^T$. So the column(row) vectors of an orthogonal matrix form orthonormal basis

Matrix-vector Multiplication

The effect of matrix-vector multiplication (here, matrix is square matrix) is rotating and/or scaling the vector. For example, $\begin{pmatrix} 3 & 2 \\ 2 & 0 \end{pmatrix} \times \begin{pmatrix} 1 \\ 1 \end{pmatrix} = \begin{pmatrix} 5 \\ 2 \end{pmatrix}$

Sometimes matrix-vector multiplication only scales the vector without changing the direction (changing to opposite direction is allowed), and this leads to the important concept of eigenvector and eigenvalue. For example, $\begin{pmatrix} 3 & 2 \\ 2 & 0 \end{pmatrix} \times \begin{pmatrix} 2 \\ 1 \end{pmatrix} = \begin{pmatrix} 8 \\ 4 \end{pmatrix} = 4 \times \begin{pmatrix} 2 \\ 1 \end{pmatrix}$

Eigenvalues and Eigenvectors

Let $A \in \mathbb{R}^{n\times n}$. The eigenvalues and eigenvectors of $A$ are the real or complex scalars $\lambda$ and n-dimensional vectors $\b{x}$ such that $$A\b{x} = \lambda \b{x} \;\; \b{x} \neq \b{0}$$

$A\b{x} = \lambda \b{x} \Rightarrow (A - \lambda I)\b{x} = 0$

This implies that $A - \lambda I$ is not full-rank, so $\det(A - \lambda I) = 0$. This is called the characteristic equation of $A$. This equation is a polynomial of degree n, so it has n roots and are called the eigenvalues of $A$. All the n roots need not be real roots, some of them could be complex root.
If the eigenvalue $\lambda$ equals 0 then $A\b{x} = 0\b{x} = 0$. Vectors with eigenvalue 0 make up the nullspace of $A$

$A$ is singular iff $\lambda = 0$ is an eigenvalue of $A$.
$\det(A) = \prod_{i = 1}^{n} \lambda_i$

The matrix $A$ is invertible if and only if all the eigenvalues $\lambda_i$ are nonzero.

Symmetric Matrix

Let $A \in \mathbb{R}^{n\times n}$. The matrix $A$ is said to be symmetric if $A^T = A$.

Let $A \in \mathbb{R}^{n\times n}$ be symmetric, then

$A$ has n real eigenvalues $\lambda_1, \lambda_2, \cdots, \lambda_n$ [proof]
eigenvectors of different eigenvalues are orthogonal [proof]
If $\lambda_i$ is a repeated root with multiplicity m >= 2, then there exist m orthonormal eigenvectors corresponding to $\lambda_i$ [proof]
a corresponding set of eigenvectors ${ \b{x}_1, \b{x}_2, \cdots, \b{x}_n}$ can be chosen to be orthonormal. Let $S = (\b{x}_1, \b{x}_2, \cdots, \b{x}_n)$, $S$ is an orthogonal matrix
$\begin{align*} S^T A S = & \begin{pmatrix} \b{x}_1^T \\\\ \b{x}_2^T \\\\ \vdots \\\\ \b{x}_n^T \end{pmatrix} A \begin{pmatrix} \b{x}_1 \b{x}_2 \cdots & \b{x}_n \end{pmatrix} = \begin{pmatrix} \b{x}_1^T A \b{x}_1 & \b{x}_1^T A \b{x}_2 & \cdots & \b{x}_1^T A \b{x}_n \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \b{x}_n^T A \b{x}_1 & \b{x}_n^T A \b{x}_2 & \cdots & \b{x}_n^T A \b{x}_n \end{pmatrix} \\\\ = & \begin{pmatrix} \lambda_1 & 0 & \cdots & 0 \\\\ 0 & \lambda_2 & \cdots & 0 \\\\ \vdots & \vdots & \ddots & \vdots \\\\ 0 & 0 & \cdots & \lambda_n \end{pmatrix} \end{align*}$

Note that a matrix is diagonalizable is not the same as a matrix is invertible, for example, $\bigl(\begin{smallmatrix} 0 & 0 \\\\ 0 & 0 \end{smallmatrix}\bigr)$ is diagonalizable (it's already diagonalized), but it's apparently not invertible

Quadratic Form

Let $A \in \mathbb{R}^{n\times n}$ be a symmetric matrix, consider $f(x) = \b{x}^T A \b{x}$, a pure quadratic form. (Note that $\b{x}^T A \b{x} = \sum_i \sum_j x_i A_{ij} x_j$, so all quadratic functions can be expressed as $\b{x}^T A \b{x}$)

$A$ is said to be	if
positive definite	$\b{x}^T A \b{x} \gt 0$ for every nonzero $\b{x} \in \mathbb{R}^n$
positive semi-definite	$\b{x}^T A \b{x} \geq 0$ for every $\b{x} \in \mathbb{R}^n$
negative definite	$\b{x}^T A \b{x} \lt 0$ for every nonzero $\b{x} \in \mathbb{R}^n$
negative semi-definite	$\b{x}^T A \b{x} \leq 0$ for every $\b{x} \in \mathbb{R}^n$
indefinite	$A$ is neither positive definite nor negative definite

$A$ is positive definite iff all its eigenvalues are positive [proof]
$A$ is indefinite iff it has both positive and negative eigenvalues
For a symmetric positive definite matrix $A$, we can define its square root as $A^{1/2} = S \Lambda^{1/2} S^T$, in which
$\Lambda^{1/2} = \begin{pmatrix} \lambda_1^{1/2} & 0 & \cdots & 0 \\\\ 0 & \lambda_2^{1/2} & \cdots & 0 \\\\ \vdots & \vdots & \ddots & \vdots \\\\ 0 & 0 & \cdots & \lambda_n^{1/2} \end{pmatrix}$
then $A = A^{1/2}A^{1/2}$ which is easy to prove.

Solution of Ax=b

Let $A \in \mathbb{R}^{n\times n}$, symmetric and positive definite

Solution of $A\b{x}=\b{b}$ is $\b{x}^{*} = A^{-1} \b{b}$
- since $A$ is symmetric and positive definite
  $\det(A) = \det(S \Lambda S^T) = \det(S)\det(\Lambda)\det(S^T) = \det(\Lambda) \gt 0$
  so $A$ is invertible
- However, the inverse operation is not numerically stable
Instead, use Cholesky decomposition $A = L L^T$
- The given system of equations is $L L^T \b{x} = \b{b}$
- Solve the triangular system. $L\b{y} = \b{b}$ using forward substitution to get $\b{y}$.
- Solve the triangular $L^T \b{x} = \b{y}$ using backward substitution to get $\b{x}^{*}$.
Cholesky decomposition is numerically stable.

Every Hermitian positive-definite matrix (and thus also every real-valued symmetric positive-definite matrix) has a unique Cholesky decomposition, and $L$ is a lower triangular matrix with real and positive diagonal entries.

Interval

Let $a, b \in \mathbb{R}$. The closed interval $[a, b]$ denotes the set, $\{ x \in \mathbb{R}: a \leq x \leq b \}$. The set $\{ x \in \mathbb{R}: a \lt x \lt b\}$ represents the open interval $(a, b)$.

Norm ball

Let $\b{x}_0 \in \mathbb{R}^n$. A norm ball of radius $r > 0$ and centre $\b{x}_0$ is given by $\{\b{x} \in \mathbb{R}^n: \Vert \b{x} - \b{x}_0 \Vert\ \leq r \}$ and will be denoted by $B[\b{x}_0, r]$.

We will use $B(\b{x}_0, r)$ to denote $\{\b{x} \in \mathbb{R}^n: \Vert \b{x} - \b{x}_0 \Vert\ \lt r \}$

Open/Close set

Let $\b{x} \in S \subseteq \mathbb{R}^n$. $\b{x}$ is called an interior point of $S$ if there exists $r \gt 0$ such that $B[\b{x}, r] \subseteq S$. The set of all such points interior to $S$ is called the interior of $S$ and is denoted by $int(S)$.

Let $S \subset \mathbb{R}^n$. $\b{x} \in \mathbb{R}^n$ belongs to the closure of $S$, $cl(S)$ if for each $\varepsilon \gt 0$, $S \cap B[\b{x}, \varepsilon] \neq \emptyset$

Example:

Let $S = (1, 2] \cup [3, 4)$. Then $cl(S) = [1, 2] \cup [3, 4]$ and $int(S) = (1, 2) \cup (3, 4)$.

A set $S \subseteq \mathbb{R}^n$ is said to be an open set if $S = int(S)$.
A set $S \subseteq \mathbb{R}^n$ is said to be an closed set if $S = cl(S)$.

Example:

open set: $B(0, r)$, $(1, 2) \cup (3, 4)$
closed set: $[1, 2] \cup [3, 4]$

Bounded and compact

The boundary of a set S is defined as $bd(S) = cl(S)\setminus int(S)$.

A set $S \subset \mathbb{R}^n$ is said to be bounded if there exist $R \; (0 \lt R \lt \infty)$ and $\b{x} \in \mathbb{R}^n$, such that $S \subset B(\b{x}, R)$.

Example:

$(1, 2] \cup [3, 100)$ is a bounded set
$(0, \infty)$ is not a bounded set

A set $S \subset \mathbb{R}^n$ is said to be compact if it's closed and bounded.

Example:

$[0, 100] \cup [1000, 10000]$ is a compact set

Sequences

Let $S \subseteq \mathbb{R}^n$ and ${ \b{x}^k }$ be a sequence of points belonging to $S$.

A sequence $\{ \b{x}^k \}$ converges to $\b{x}^{*}$, if for any given $\varepsilon \gt 0$, there is a positive integer $K$ such that $$ \Vert \b{x}^k - \b{x}^* \Vert \leq \varepsilon \;\; \forall k \geq K$$ We write this as $\b{x}^k \rightarrow \b{x}^*$ or $\lim_{k \rightarrow \infty} \b{x}^k = \b{x}^*$.

Continuous Function

Let $S \subseteq \mathbb{R}^n$. A function $f: S\rightarrow \mathbb{R}$ is said to be continuous at $\bar{\b{x}} \in S$ if for any given $\varepsilon \gt 0$ there exists a $\delta \gt 0$ such that $\b{x} \in S$ and $\Vert \b{x} - \bar{\b{x}} \Vert \lt \delta$ implies that $|f(\b{x}) - f(\bar{\b{x}})| \lt \varepsilon$.

$f$ is said to be continuous on $A \subset \mathbb{R}^n$ if it’s continuous at each point of $A$
When we say $f$ is continuous, we mean that it’s continuous on its domain
$\mathcal{C}$: Class of all continuous functions

Example:

Lipschitz Continuous

Let $S \subseteq \mathbb{R}^n$. A function $f: S \rightarrow \mathbb{R}^m$ is said to be lipschitz continuous if $$ \Vert f(\b{x}_1) - f(\b{x}_2) \Vert \leq L \Vert \b{x}_1 - \b{x}_2 \Vert \;\; \forall \b{x}_1, \b{x}_2 \in S$$ where $L$ is a positive constant

所以 lipschitz continuous 给任意两点函数值的差异加了一个上限

Gradient

A continuous function, $f: \mathbb{R}^n \rightarrow \mathbb{R}$, is said to be continuously differentiable at $\b{x} \in \mathbb{R}^n$, if $\frac{\p f}{\p x_i}(\b{x})$ exists and is continuous, $i = 1, 2, \cdots, n$.

$\mathcal{C}^1$: Class of functions whose first p derivatives are continuous
Assumption: $f \in \mathcal{C}^1$

We define the gradient of $f$ at $\b{x}$ to be the vector $$g(\b{x}) = \nabla f(\b{x}) = (\frac{\p f(\b{x})}{\p x_1}, \frac{\p f(\b{x})}{\p x_2}, \cdots, \frac{\p f(\b{x})}{\p x_n})^T$$

Gradient is perpenticular to level surface (contour line) [proof]

Directional Derivatives

Let $S \subset \mathbb{R}^n$ be an open set and $f: S \rightarrow \mathbb{R}$, $f$ continuously differentiable in $S$. Then, for any $\b{x} \in S$ and any nonzero $\b{d} \in \mathbb{R}^n$, the directional derivative of $f$ at $\b{x}$ in the direction of $\b{d}$, defined by $$\frac{\p f}{\p \b{d}}(\b{x}) = \lim_{\varepsilon \rightarrow 0} \frac{f(\b{x} + \varepsilon \b{d}) - f(\b{x})}{\varepsilon}$$ exists and equals $\nabla f(\b{x})^T \b{d}$.

$\begin{align*} \frac{\p f}{\p \b{d}}(\b{x}) = & \lim_{\varepsilon \rightarrow 0} \frac{f(\b{x} + \varepsilon \b{d}) - f(\b{x})}{\varepsilon} \\\\ = & \lim_{\varepsilon \rightarrow 0} \frac{f(\b{x} + (0 + \varepsilon) \b{d}) - f(\b{x} + 0 \b{d})}{(0 + \varepsilon) - 0} \\\\ = & \frac{df(x+t\b{d})}{dt}\bigg|_{t=0} \\\\ = & \nabla f(x+t\b{d})^T \b{d}\bigg|_{t=0} \\\\ = & \nabla f(x)^T \b{d} \end{align*}$

For any unit direction $\b{d}$, the rate of change is $\nabla f(x)^T \b{d} = \Vert \nabla f(x)\Vert \Vert d \Vert \cos\theta = \Vert \nabla f(x) \Vert \cos\theta$, where $\theta$ is the angle between $\b{d}$ and $\nabla f(x)$. When $\theta = 0$ the rate of change is maximized. This is why the gradient is the direction of steepest descent/ascent.

Hessian

A continuously differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is said to be twice continuously differentiable at $\b{x} \in \mathbb{R}^n$, if $\frac{\p^2 f}{\p x_i \p x_j} (\b{x})$ exists and is continuous.

$\mathcal{C}^2$: Class of twice continuously differentiable functions
saying “hessian is continuous” is equivalent to $f \in \mathcal{C}^2$

Let $f \in \mathcal{C}^2$. We denote the Hessian of $f$ at $\b{x}$ to be the matrix $$ H(x) = \nabla^2 f(\b{x}) = \begin{pmatrix} \frac{\p^2 f}{\p x_1^2} & \frac{\p^2 f}{\p x_1 \p x_2} & \cdots & \frac{\p^2 f}{\p x_1 \p x_n} \\\\ \frac{\p^2 f}{\p x_2 \p x_1} & \frac{\p^2 f}{\p x_2^2} & \cdots & \frac{\p^2 f}{\p x_2 \p x_n} \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \frac{\p^2 f}{\p x_n \p x_1} & \frac{\p^2 f}{\p x_n \p x_2} & \cdots & \frac{\p^2 f}{\p x_n^2} \end{pmatrix} $$

Hessian matrix is symmetric

Let $S \subset \mathbb{R}^n$ be an open set and $f: S \rightarrow \mathbb{R}$, $f$ twice continuously differentiable in $S$. Then, for any $\b{x} \in S$ and any nonzero $\b{d} \in \mathbb{R}^n$, the second directional derivative of $f$ at $\b{x}$ in the direction of $\b{d}$ equals $\b{d}^T \nabla^2 f(\b{x}) \b{d}$.

Taylor Series

$\mathcal{C}^\infty$: Class of all functions for which the derivative of any order is continuous.

Let $f: \mathbb{R} \rightarrow \mathbb{R}. f \in \mathcal{C}^\infty.$

Let $x^0$ be the point about which we write the Taylor series.

$f(x) = f(x^0) + f'(x^0)(x - x^0) + \frac{1}{2} f''(x^0)(x - x^0)^2 + \cdots$

Suppose we use only $f’(x^0)$. Then $f(x)$ at $x^0$ can be approximated by

$f(x) \approx f(x^0) + f'(x^0)(x - x^0)$

Similarly, using $f’(x^0)$ and $f’‘(x^0)$, then the quadratic approximation of $f$ at $x^0$ is

$f(x) \approx f(x^0) + f'(x^0)(x - x^0) + \frac{1}{2} f''(x^0)(x - x^0)^2$

Truncated Taylor Series

first order

Let $f: \mathbb{R}^n \rightarrow \mathbb{R}, f \in \mathcal{C}^1, \b{x}^0 \in \mathbb{R}^n$.

Then, for every $\b{x} \in \mathbb{R}^n$,
$f(\b{x}) = f(\b{x}^0) + \nabla f(\bar{\b{x}}) \cdot (\b{x} - \b{x}^0)$
where $\bar{\b{x}}$ is some point that lies on the line segment joining $\b{x}$ and $\b{x}^0$: $\bar{\b{x}}$ depends on $\b{x}, \b{x}^0$ and $f$.
second order

Let $f: \mathbb{R}^n \rightarrow \mathbb{R}, f \in \mathcal{C}^2, \b{x}^0 \in \mathbb{R}^n$.

Then, for every $\b{x} \in \mathbb{R}^n$,
$f(\b{x}) = f(\b{x}^0) + \nabla f(\b{x}) \cdot (\b{x} - \b{x}^0) + \frac{1}{2}(\b{x} - \b{x}^0)^T \nabla^2 f(\bar{\b{x}}) (\b{x} - \b{x}^0)$
where $\bar{\b{x}}$ is some point that lies on the line segment joining $\b{x}$ and $\b{x}^0$: $\bar{\b{x}}$ depends on $\b{x}, \b{x}^0$ and $f$.

上面两个式子中最后一项被称为 Lagrange remainder。因为我们处理的都是多变量的情况，上面我们用矩阵乘法的形式表示，但是到了更高阶的近似就不好再用矩阵表示了，以 3 阶为例，展开的 taylor series 是这样的
$\begin{align*} f(\b{x}) = & f(\b{x}^0) + \sum_{i=1}^{n} \frac{\p f(\b{x})}{\p x_i} (x_i - x^0_i) + \frac{1}{2!} \sum_{i=1}^n \sum_{j=1}^n \frac{\p f(\b{x})}{\p x_i \p x_j} (x_i - x^0_i)(x_j - x^0_j) + \\\\ & \frac{1}{3!} \sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^n \frac{\p f(\bar{\b{x}})}{\p x_i \p x_j \p x_k} (x_i - x^0_i)(x_j - x^0_j)(x_k - x^0_k) \end{align*}$

Matrix Derivate

下面以小写字符 $x$ 表示 scalar，粗体的小写字符 $\b{x}$ 表示 vector, 以大写字符 $X$ 表示 matrix

基于 matrix 或 vector 的 derivative 结果可以有不同的表示方式，比如一个 vector $\b{y} \in \mathbb{R}^m$ 对另一个 vector $\b{x} \in \mathbb{R}^n$ 求导，其结果可以是个 $m\times n$ matrix，也可以是个 $n\times m$ matrix，前者被称为 Numerator-layout，后者被称为 Denominator-layout，下面我们看看这两种不同的 layout 的主要区别

	Numerator-layout	Denominator-layout
$\frac{\p y}{\p \b{x}}$ scalar by vector	$\begin{pmatrix} \frac{\p y}{\p x_1} & \frac{\p y}{\p x_2} & \cdots & \frac{\p y}{\p x_n} \end{pmatrix}$	$\begin{pmatrix} \frac{\p y}{\p x_1} \\\\[-1em] \frac{\p y}{\p x_2} \\\\[-1em] \vdots \\\\[-1em] \frac{\p y}{\p x_n} \end{pmatrix}$
$\frac{\p \b{y}}{\p x}$ vector by scalar	$\begin{pmatrix} \frac{\p y_1}{\p x} \\\\[-1em] \frac{\p y_2}{\p x} \\\\[-1em] \vdots \\\\[-1em] \frac{\p y_n}{\p x} \end{pmatrix}$	$\begin{pmatrix} \frac{\p y_1}{\p x} & \frac{\p y_2}{\p x} & \cdots & \frac{\p y_n}{\p x} \end{pmatrix}$
$\frac{\p \b{y}}{\p \b{x}}$ vector by vector	$\begin{pmatrix} \frac{\p y_1}{\p x_1} & \frac{\p y_1}{\p x_2} & \cdots & \frac{\p y_1}{\p x_n} \\\\[-1em] \frac{\p y_2}{\p x_1} & \frac{\p y_2}{\p x_2} & \cdots & \frac{\p y_2}{\p x_n} \\\\[-1em] \vdots & \vdots & \ddots & \vdots \\\\[-1em] \frac{\p y_m}{\p x_1} & \frac{\p y_m}{\p x_2} & \cdots & \frac{\p y_m}{\p x_n} \end{pmatrix}$	$\begin{pmatrix} \frac{\p y_1}{\p x_1} & \frac{\p y_2}{\p x_1} & \cdots & \frac{\p y_m}{\p x_1} \\\\[-1em] \frac{\p y_1}{\p x_2} & \frac{\p y_2}{\p x_2} & \cdots & \frac{\p y_m}{\p x_2} \\\\[-1em] \vdots & \vdots & \ddots & \vdots \\\\[-1em] \frac{\p y_1}{\p x_n} & \frac{\p y_2}{\p x_n} & \cdots & \frac{\p y_m}{\p x_n} \end{pmatrix}$

所以二者的区别就是对于 Numerator-layout，其行号由 Numerator 决定，而对于 Denominator-layout，其行号由 Denominator 决定

目前我比较常遇到的是 Denominator-layout，所以就以 Denominator-layout 的方式来表示 matrix derivative

the columns of a matrix $A \in \mathbb{R}^{m\times n}$ are $a_1$ through $a_n$, while the rows are $\tilde{a}_1^T$ through $\tilde{a}_n^T$.

Derivative of $A\b{x}$, where $A \in \mathbb{R}^{m\times n}$ and $\b{x} \in \mathbb{R}^n$
$\nabla_{\b{x}}A\b{x} = \begin{pmatrix} \nabla_{\b{x}} \tilde{a}_1^T \b{x} \\\\ \nabla_{\b{x}} \tilde{a}_2^T \b{x} \\\\ \vdots \\\\ \nabla_{\b{x}} \tilde{a}_n^T \b{x} \\\\ \end{pmatrix} = \begin{pmatrix} \tilde{a}_1 & \tilde{a}_2 & \cdots & \tilde{a}_n \end{pmatrix} = A^T$
Derivative of linear transformed input to function

Consider a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$. Suppose we have a matrix $A \in \mathbb{R}^{n\times m}$ and a vector $\b{x} \in \mathbb{R}^m$. We wish to compute $\nabla_{\b{x}} f(A\b{x})$.

Note that $f(A\b{x})$ is a composition of multivariate functions, so we can apply the following chain rule to calculate derivate

Suppose $u = f(x_1, x_2, \cdots, x_n)$ is a differentiable function of variables $x_1, x_2, \cdots, x_n$, and all functions $g_j(t_1, t_2, \cdots, t_m), j = 1, \cdots, m$ are also differentiable, then their composition $u = f(g_1, g_2, \cdots, g_n)$ as a function of $t_1, t_2, \cdots, t_m$ is also differentiable, and $$ \frac{\p u}{\p t_j} = \sum_{k=1}^{n} \frac{\p f}{\p x_k} \frac{\p g_k}{\p t_j} \;\;\; j = 1, \cdots m$$

By the chain rule, we have
$\begin{align*} \frac{\p f(A\b{x})}{\p x_j} = & \sum_{k=1}^{n} \frac{\p f(A\b{x})}{\p (A\b{x})_k} \frac{\p (A\b{x})_k}{\p x_j} \\\\ = & \sum_{k=1}^{n} \frac{\p f(A\b{x})}{\p (A\b{x})_k} \frac{\p \tilde{a}_k^T \b{x}}{\p x_j} \\\\ = & \sum_{k=1}^{n} \frac{\p f(A\b{x})}{\p (A\b{x})_k} a_{kj} \\\\ = & a_j^T \nabla f(A\b{x}) \end{align*}$
so
$\begin{align*} \nabla_{\b{x}} f(A\b{x}) = \begin{pmatrix} \frac{\p f(A\b{x})}{\p x_1} \\\\ \vdots \\\\ \frac{\p f(A\b{x})}{\p x_n} \\\\ \end{pmatrix} = \begin{pmatrix} a_1^T \\\\ a_2^T \\\\ \vdots \\\\ a_n^T \end{pmatrix} \nabla f(A\b{x}) = A^T \nabla f(A \b{x}) \end{align*}$

Reference: berkeley.edu, okstate.edu