02 - Mathematical Background

这篇笔记中给出了学习 Numerical Optimization 必备的一些数学概念和结论,关于一些 重要结论的证明单独在 Mathematical Background in Detail 中给出,具体的每个结论的证明会在结论后给出链接


Set

Difference between two sets A and B: $A \setminus B = \{x: x \in A \;\; and \;\; x \notin B\}$


Supremum

The smallest possible real number $y$ satisfying $x \leq y$ for every $x \in A$. Denoted as $\sup \{x : x \in A\}$.

Infimum

The greatest possible real number $y$ satisfying $x \geq y$ for every $x \in A$. Denoted as $\inf \\{x : x \in A\\}$.

For example:

$A = \{x: 1 \leq x \lt 3\}$


Vector Space:

A nonempty set $S$ is called a vector space if
  • For any $\b{x}, \b{y} \in S$, $\b{x} + \b{y}$ is defined and is in $S$. Further,
  • $$\b{x} + \b{y} = \b{y} + \b{x} \;\; \text{(commutativity)}$$ $$\b{x} + (\b{y} + \b{z}) = (\b{x} + \b{y}) + \b{z} \;\; \text{(associativity)}$$
  • There exists an element in $S$, $\b{0}$, such that $\b{x} + \b{0} = \b{0} + \b{x}$ for all $\b{x}$.
  • For any $\b{x} \in S$, there exists $\b{y} \in S$, such that $\b{x} + \b{y} = \b{0}$.
  • For any $\b{x} \in S$ and $\alpha \in \mathbb{R}$, $\alpha \b{x}$ is defined and is in $S$, Further,
  • $$\b{1}\b{x} = \b{x} \;\; \forall \b{x}$$
  • For any $\b{x}, \b{y} \in S$ and $\alpha, \beta \in \mathbb{R}$,
  • $$\alpha(\b{x} + \b{y}) = \alpha \b{x} + \alpha \b{y}$$ $$(\alpha + \beta)\b{x} = \alpha \b{x} + \beta \b{x}$$ $$\alpha(\beta \b{x}) = \alpha \beta \b{x}$$

Elements in S are called vector.


Notations:

  • $\mathbb{R}$: vector space of real numbers
  • $\mathbb{R}^n$: vector space of real $n \times 1$ vectors
  • n-vector $\b{x}$ is an array of n scalars, $x_1, x_2, ..., x_n$
  • $\b{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}$
  • $\b{x}^T = (x_1, x_2, ..., x_n)$
  • $\b{0}^T = (0, 0, ..., 0)$
  • $\b{1}^T = (1, 1, ..., 1)$ (We also use $\b{e}$ to denote this vector)

  • Vector Subspace:

    If $S$ and $T$ are vector spaces so that $S \subseteq T$, then $S$ is called a subspace of $T$.

    All possible subspaces of $\mathbb{R}^2$ are $\mathbb{R}^2$, $\{(0, 0)\}$, and $\{(x_1, x_2): a x_1 + b x_2 = 0, a \in \mathbb{R}, b \in \mathbb{R}, a^2 + b^2 \neq 0\}$ (which means all lines going through $(0, 0)$).

    $\{(x_1, x_2): a x_1 + b x_2 = c, a \in \mathbb{R}, b \in \mathbb{R}, a^2 + b^2 \neq 0, c \neq 0\}$ is called affine space.


    Spanning Set:

    A set of vectors $\b{x}_1, \b{x}_2, ..., \b{x}_k$ is said to span the vector space $S$ if any vector $\b{x} \in S$ can be represented as: $$\b{x} = \sum_{i=1}^{k}\alpha_i \b{x}_i \;\; \alpha_i \in \mathbb{R}$$

    Linear Independence:

    A set of vectors $\b{x}_1, \b{x}_2, ..., \b{x}_k$ is said to linearly independent if $$\sum_{i=1}^{k} \alpha_i \b{x}_i = 0 \Rightarrow \alpha_i = 0 \;\; \forall i$$ Otherwise, they are linearly dependent, and one of them is linear combination of the others.

    For example, $(1, 0)$ and $(1, 1)$ are linearly independent, while $(1, 0)$ and $((-1, 0))$ are not.


    Basis

    A set of vectors is said to be a basis for the vector space $S$ if it is linearly independent and spans $S$.

    For example, $\{(1, 0), (1, 1)\}$ spans $\mathbb{R}^2$.

    Properties of basis:

    1. A vector space does not have a unique basis.
    2. If $\b{x}_1, \b{x}_2, …, \b{x}_k$ is a basis for $S$, then any $\b{x} \in S$ can be uniquely represented using $\b{x}_1, \b{x}_2, …, \b{x}_k$.
    3. Any two basis of a vector space have the same cardinality.
    4. Let $\b{e}_i$ denote an n-dimensional vector whose i-th element is 1 and the remaining elements are 0’s. Then, the set $\b{e}_1, \b{e}_2, …, \b{e}_n$ forms a standard basis for $\mathbb{R}^n$.
    5. A basis for the vector space $S$ is a maximal independent set of vectors which spans the space $S$.
    6. A basis for the vector space $S$ is a minimum spanning set of vectors which spans the space $S$.

    Property 2 can be easily proved by contradiction.

    Dimension of Vector Space

    The dimension of vector space $S$ is the cardinality of a basis of $S$.

    The dimension of $\mathbb{R}^n$ is $n$.


    Function

    A function $f$ from a set $A$ to a set $B$ is a rule that assigns to each $x$ in $A$ a unique element $f(x)$ in $B$. This function can be represented by $$f:A\rightarrow B$$

    Note:


    Norm

    A norm on $\mathbb{R}^n$ is real-value function $\Vert \cdot \Vert: \mathbb{R}^n \rightarrow \mathbb{R}$ which obeys
    • $\Vert \b{x} \Vert \geq 0$ for every $\b{x} \in \mathbb{R}^n$, and $\Vert \b{x} \Vert = 0$ iff $\b{x} = 0$.
    • $\Vert \alpha \b{x} \Vert = |\alpha|\Vert \b{x} \Vert$ for every $\b{x} \in \mathbb{R}^n$ and $\alpha \in \mathbb{R}$.
    • $\Vert \b{x} + \b{y}\Vert \leq \Vert \b{x} \Vert + \Vert \b{y} \Vert$ for every $\b{x} \in \mathbb{R}^n$ and $\b{y} \in \mathbb{R}^n$.

    Let $\b{x} \in \mathbb{R}^n$. Some popular norms:

    In general, $L_p$ norm is defined as $\Vert \b{x} \Vert_{p} = (\sum_{i=1}^{n} x_i ^p)^{\frac{1}{p}}$.
    If $\Vert \cdot \Vert_p$ and $\Vert \cdot \Vert_q$ are any two norms on $\mathbb{R}^n$, then there exist positive constants $\alpha$ and $\beta$ such that $$\alpha \Vert \b{x} \Vert_p \leq \Vert \b{x} \Vert_q \leq \beta \Vert \b{x} \Vert_p$$ for any $\b{x} \in \mathbb{R}^n$

    So the convergece of an optimization algorithm does not depend on what norm its stopping criterion used.


    Inner Product

    Let $\b{x}, \b{y} \in \mathbb{R}^n$ and $\b{x} \neq \b{0} \neq \b{y}$. The inner or dot product is defined as $$\b{x} \cdot \b{y} \equiv \b{x}^T \b{y} = \sum_{i=1}^{n} x_i \cdot y_i = \Vert \b{x} \Vert \cdot \Vert \b{y} \Vert \cos \theta$$ where $\theta$ is the angle between $\b{x}$ and $\b{y}$.

    Note:


    Orthogonality

    Let $\b{x} \in \mathbb{R}^n$ and $\b{y} \in \mathbb{R}^n$. $\b{x}$ and $\b{y}$ are said to perpendicular or orthogonal to each other if $\b{x}^T \b{y} = 0$. Two subspaces $S$ and $T$ of the same vector space $\mathbb{R}^n$ are orthogonal if every vector $\b{x} \in S$ is orthogonal to every vector $\b{y} \in T$, i.e. $\b{x}^T \b{y} = 0 \;\; \forall \b{x} \in S, \b{y} \in T$.

    Mutual Orthogonality

    Vectors $\b{x}_1, \b{x}_2, \cdots, \b{x}_k \in \mathbb{R}^n$ are said to be mutually orthogonal if $\b{x}_i \cdot \b{x}_j = 0$ for all $i \neq j$. If, in addition, $\Vert \b{x}_i \Vert = 1$ for every $i$, the set $\{ \b{x}_1, \cdots, \b{x}_k\}$ is said to be orthonormal.

    It’s easy to show that if $\b{x}_1, \cdots, \b{x}_k$ are mutually orthogonal nonzero vectors, then they are linearly independent.

    对于一个 squre matrix $Q$ [proof]


    Gram-Schmidt Procedure

    To produce an orthonormal basis with a given basis $\b{x}_1, \b{x}_2, \cdots, \b{x}_k \in \mathbb{R}^n$, we use Gram-Schmidt Procedure.

    Given $\b{x}_1, \b{x}_2, \b{x}_3 \in \mathbb{R}^3$, to produce an orthonormal basis $\b{y}_1, \b{y}_2, \b{y}_3$.

    It’s easy to extend this procedure to $\mathbb{R}^n$.


    Matrix


    Determinant

    From wikipedia:

    Let $A$ be $n\times n$ matrix. There are various way to define the determinant of a square matrix. Perhaps the most natrual way is expressed in terms of the columns of the matrix. If we write an $n\times n$ matrix in terms of its column vectors $$A = [\b{a}_1, \b{a}_2, \cdots, \b{a}_n]$$ where $\b{a}_i$ are vectors of size n, then the determinant of $A$ is defined so that $$\det[\b{a}_1, \cdots, b \b{a}_j + c \b{v}, \cdots, \b{a}_n] = b \det(A) + c \det[\b{a}_1, \cdots, \b{v}, \cdots, \b{a}_n]$$ $$\det[\b{a}_1, \cdots, \b{a}_j, \b{a}_{j+1}, \cdots, \b{a}_n] = -\det[\b{a}_1, \cdots, \b{a}_{j+1}, \b{a}_j, \cdots, \b{a}_n]$$ $$\det(I) = 1$$

    Rank

    Let $A \in \mathbb{R}^{m\times n}$.

    The subspace of $\mathbb{R}^m$, spanned by the column vectors of $A$ is called the column space of $A$. The subspace of $\mathbb{R}^n$, spanned by the row vectors of $A$ is called the row space of $A$.
    Column Rank: The dimension of the column space.
    Row Rank: The dimension of the row space.

    Invertible

    A square matrix $A$ is said to be invertible if there exists a matrix $B$ such that $AB = BA = I$. There's at most one such $B$ and is denoted by $A^{-1}$.

    Matrix-vector Multiplication

    The effect of matrix-vector multiplication (here, matrix is square matrix) is rotating and/or scaling the vector. For example, $\begin{pmatrix} 3 & 2 \\ 2 & 0 \end{pmatrix} \times \begin{pmatrix} 1 \\ 1 \end{pmatrix} = \begin{pmatrix} 5 \\ 2 \end{pmatrix}$

    Sometimes matrix-vector multiplication only scales the vector without changing the direction (changing to opposite direction is allowed), and this leads to the important concept of eigenvector and eigenvalue. For example, $\begin{pmatrix} 3 & 2 \\ 2 & 0 \end{pmatrix} \times \begin{pmatrix} 2 \\ 1 \end{pmatrix} = \begin{pmatrix} 8 \\ 4 \end{pmatrix} = 4 \times \begin{pmatrix} 2 \\ 1 \end{pmatrix}$


    Eigenvalues and Eigenvectors

    Let $A \in \mathbb{R}^{n\times n}$. The eigenvalues and eigenvectors of $A$ are the real or complex scalars $\lambda$ and n-dimensional vectors $\b{x}$ such that $$A\b{x} = \lambda \b{x} \;\; \b{x} \neq \b{0}$$

    Symmetric Matrix

    Let $A \in \mathbb{R}^{n\times n}$. The matrix $A$ is said to be symmetric if $A^T = A$.

    Let $A \in \mathbb{R}^{n\times n}$ be symmetric, then

    Note that a matrix is diagonalizable is not the same as a matrix is invertible, for example, $\bigl(\begin{smallmatrix} 0 & 0 \\\\ 0 & 0 \end{smallmatrix}\bigr)$ is diagonalizable (it's already diagonalized), but it's apparently not invertible

    Quadratic Form

    Let $A \in \mathbb{R}^{n\times n}$ be a symmetric matrix, consider $f(x) = \b{x}^T A \b{x}$, a pure quadratic form. (Note that $\b{x}^T A \b{x} = \sum_i \sum_j x_i A_{ij} x_j$, so all quadratic functions can be expressed as $\b{x}^T A \b{x}$)

    $A$ is said to be if
    positive definite $\b{x}^T A \b{x} \gt 0$ for every nonzero $\b{x} \in \mathbb{R}^n$
    positive semi-definite $\b{x}^T A \b{x} \geq 0$ for every $\b{x} \in \mathbb{R}^n$
    negative definite $\b{x}^T A \b{x} \lt 0$ for every nonzero $\b{x} \in \mathbb{R}^n$
    negative semi-definite $\b{x}^T A \b{x} \leq 0$ for every $\b{x} \in \mathbb{R}^n$
    indefinite $A$ is neither positive definite nor negative definite

    Solution of Ax=b

    Let $A \in \mathbb{R}^{n\times n}$, symmetric and positive definite


    Interval

    Let $a, b \in \mathbb{R}$. The closed interval $[a, b]$ denotes the set, $\{ x \in \mathbb{R}: a \leq x \leq b \}$. The set $\{ x \in \mathbb{R}: a \lt x \lt b\}$ represents the open interval $(a, b)$.

    Norm ball

    Let $\b{x}_0 \in \mathbb{R}^n$. A norm ball of radius $r > 0$ and centre $\b{x}_0$ is given by $\{\b{x} \in \mathbb{R}^n: \Vert \b{x} - \b{x}_0 \Vert\ \leq r \}$ and will be denoted by $B[\b{x}_0, r]$.

    We will use $B(\b{x}_0, r)$ to denote $\{\b{x} \in \mathbb{R}^n: \Vert \b{x} - \b{x}_0 \Vert\ \lt r \}$


    Open/Close set

    Let $\b{x} \in S \subseteq \mathbb{R}^n$. $\b{x}$ is called an interior point of $S$ if there exists $r \gt 0$ such that $B[\b{x}, r] \subseteq S$. The set of all such points interior to $S$ is called the interior of $S$ and is denoted by $int(S)$.
    Let $S \subset \mathbb{R}^n$. $\b{x} \in \mathbb{R}^n$ belongs to the closure of $S$, $cl(S)$ if for each $\varepsilon \gt 0$, $S \cap B[\b{x}, \varepsilon] \neq \emptyset$

    Example:

    Let $S = (1, 2] \cup [3, 4)$. Then $cl(S) = [1, 2] \cup [3, 4]$ and $int(S) = (1, 2) \cup (3, 4)$.

    A set $S \subseteq \mathbb{R}^n$ is said to be an open set if $S = int(S)$.
    A set $S \subseteq \mathbb{R}^n$ is said to be an closed set if $S = cl(S)$.

    Example:


    Bounded and compact

    The boundary of a set S is defined as $bd(S) = cl(S)\setminus int(S)$.
    A set $S \subset \mathbb{R}^n$ is said to be bounded if there exist $R \; (0 \lt R \lt \infty)$ and $\b{x} \in \mathbb{R}^n$, such that $S \subset B(\b{x}, R)$.

    Example:

    A set $S \subset \mathbb{R}^n$ is said to be compact if it's closed and bounded.

    Example:


    Sequences

    Let $S \subseteq \mathbb{R}^n$ and ${ \b{x}^k }$ be a sequence of points belonging to $S$.

    A sequence $\{ \b{x}^k \}$ converges to $\b{x}^{*}$, if for any given $\varepsilon \gt 0$, there is a positive integer $K$ such that $$ \Vert \b{x}^k - \b{x}^* \Vert \leq \varepsilon \;\; \forall k \geq K$$ We write this as $\b{x}^k \rightarrow \b{x}^*$ or $\lim_{k \rightarrow \infty} \b{x}^k = \b{x}^*$.

    Continuous Function

    Let $S \subseteq \mathbb{R}^n$. A function $f: S\rightarrow \mathbb{R}$ is said to be continuous at $\bar{\b{x}} \in S$ if for any given $\varepsilon \gt 0$ there exists a $\delta \gt 0$ such that $\b{x} \in S$ and $\Vert \b{x} - \bar{\b{x}} \Vert \lt \delta$ implies that $|f(\b{x}) - f(\bar{\b{x}})| \lt \varepsilon$.

    Example:


    Lipschitz Continuous

    Let $S \subseteq \mathbb{R}^n$. A function $f: S \rightarrow \mathbb{R}^m$ is said to be lipschitz continuous if $$ \Vert f(\b{x}_1) - f(\b{x}_2) \Vert \leq L \Vert \b{x}_1 - \b{x}_2 \Vert \;\; \forall \b{x}_1, \b{x}_2 \in S$$ where $L$ is a positive constant

    所以 lipschitz continuous 给任意两点函数值的差异加了一个上限


    Gradient

    A continuous function, $f: \mathbb{R}^n \rightarrow \mathbb{R}$, is said to be continuously differentiable at $\b{x} \in \mathbb{R}^n$, if $\frac{\p f}{\p x_i}(\b{x})$ exists and is continuous, $i = 1, 2, \cdots, n$.
    We define the gradient of $f$ at $\b{x}$ to be the vector $$g(\b{x}) = \nabla f(\b{x}) = (\frac{\p f(\b{x})}{\p x_1}, \frac{\p f(\b{x})}{\p x_2}, \cdots, \frac{\p f(\b{x})}{\p x_n})^T$$

    Directional Derivatives

    Let $S \subset \mathbb{R}^n$ be an open set and $f: S \rightarrow \mathbb{R}$, $f$ continuously differentiable in $S$. Then, for any $\b{x} \in S$ and any nonzero $\b{d} \in \mathbb{R}^n$, the directional derivative of $f$ at $\b{x}$ in the direction of $\b{d}$, defined by $$\frac{\p f}{\p \b{d}}(\b{x}) = \lim_{\varepsilon \rightarrow 0} \frac{f(\b{x} + \varepsilon \b{d}) - f(\b{x})}{\varepsilon}$$ exists and equals $\nabla f(\b{x})^T \b{d}$.

    For any unit direction $\b{d}$, the rate of change is $\nabla f(x)^T \b{d} = \Vert \nabla f(x)\Vert \Vert d \Vert \cos\theta = \Vert \nabla f(x) \Vert \cos\theta$, where $\theta$ is the angle between $\b{d}$ and $\nabla f(x)$. When $\theta = 0$ the rate of change is maximized. This is why the gradient is the direction of steepest descent/ascent.


    Hessian

    A continuously differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is said to be twice continuously differentiable at $\b{x} \in \mathbb{R}^n$, if $\frac{\p^2 f}{\p x_i \p x_j} (\b{x})$ exists and is continuous.
    Let $f \in \mathcal{C}^2$. We denote the Hessian of $f$ at $\b{x}$ to be the matrix $$ H(x) = \nabla^2 f(\b{x}) = \begin{pmatrix} \frac{\p^2 f}{\p x_1^2} & \frac{\p^2 f}{\p x_1 \p x_2} & \cdots & \frac{\p^2 f}{\p x_1 \p x_n} \\\\ \frac{\p^2 f}{\p x_2 \p x_1} & \frac{\p^2 f}{\p x_2^2} & \cdots & \frac{\p^2 f}{\p x_2 \p x_n} \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \frac{\p^2 f}{\p x_n \p x_1} & \frac{\p^2 f}{\p x_n \p x_2} & \cdots & \frac{\p^2 f}{\p x_n^2} \end{pmatrix} $$
    Let $S \subset \mathbb{R}^n$ be an open set and $f: S \rightarrow \mathbb{R}$, $f$ twice continuously differentiable in $S$. Then, for any $\b{x} \in S$ and any nonzero $\b{d} \in \mathbb{R}^n$, the second directional derivative of $f$ at $\b{x}$ in the direction of $\b{d}$ equals $\b{d}^T \nabla^2 f(\b{x}) \b{d}$.

    Taylor Series

    $\mathcal{C}^\infty$: Class of all functions for which the derivative of any order is continuous.

    Let $f: \mathbb{R} \rightarrow \mathbb{R}. f \in \mathcal{C}^\infty.$

    Let $x^0$ be the point about which we write the Taylor series.

    Suppose we use only $f’(x^0)$. Then $f(x)$ at $x^0$ can be approximated by

    Similarly, using $f’(x^0)$ and $f’‘(x^0)$, then the quadratic approximation of $f$ at $x^0$ is


    Truncated Taylor Series


    Matrix Derivate

    下面以小写字符 $x$ 表示 scalar,粗体的小写字符 $\b{x}$ 表示 vector, 以大写字符 $X$ 表示 matrix

    基于 matrix 或 vector 的 derivative 结果可以有不同的表示方式,比如一个 vector $\b{y} \in \mathbb{R}^m$ 对另一个 vector $\b{x} \in \mathbb{R}^n$ 求导,其结果 可以是个 $m\times n$ matrix,也可以是个 $n\times m$ matrix,前者被称为 Numerator-layout,后者被称为 Denominator-layout,下面我们看看这两种不同的 layout 的主要区别

    Numerator-layout Denominator-layout
    $\frac{\p y}{\p \b{x}}$
    scalar
    by
    vector
    $\begin{pmatrix} \frac{\p y}{\p x_1} & \frac{\p y}{\p x_2} & \cdots & \frac{\p y}{\p x_n} \end{pmatrix}$ $\begin{pmatrix} \frac{\p y}{\p x_1} \\\\[-1em] \frac{\p y}{\p x_2} \\\\[-1em] \vdots \\\\[-1em] \frac{\p y}{\p x_n} \end{pmatrix}$
    $\frac{\p \b{y}}{\p x}$
    vector
    by
    scalar
    $\begin{pmatrix} \frac{\p y_1}{\p x} \\\\[-1em] \frac{\p y_2}{\p x} \\\\[-1em] \vdots \\\\[-1em] \frac{\p y_n}{\p x} \end{pmatrix}$ $\begin{pmatrix} \frac{\p y_1}{\p x} & \frac{\p y_2}{\p x} & \cdots & \frac{\p y_n}{\p x} \end{pmatrix}$
    $\frac{\p \b{y}}{\p \b{x}}$
    vector
    by
    vector
    $\begin{pmatrix} \frac{\p y_1}{\p x_1} & \frac{\p y_1}{\p x_2} & \cdots & \frac{\p y_1}{\p x_n} \\\\[-1em] \frac{\p y_2}{\p x_1} & \frac{\p y_2}{\p x_2} & \cdots & \frac{\p y_2}{\p x_n} \\\\[-1em] \vdots & \vdots & \ddots & \vdots \\\\[-1em] \frac{\p y_m}{\p x_1} & \frac{\p y_m}{\p x_2} & \cdots & \frac{\p y_m}{\p x_n} \end{pmatrix}$ $\begin{pmatrix} \frac{\p y_1}{\p x_1} & \frac{\p y_2}{\p x_1} & \cdots & \frac{\p y_m}{\p x_1} \\\\[-1em] \frac{\p y_1}{\p x_2} & \frac{\p y_2}{\p x_2} & \cdots & \frac{\p y_m}{\p x_2} \\\\[-1em] \vdots & \vdots & \ddots & \vdots \\\\[-1em] \frac{\p y_1}{\p x_n} & \frac{\p y_2}{\p x_n} & \cdots & \frac{\p y_m}{\p x_n} \end{pmatrix}$

    所以二者的区别就是对于 Numerator-layout,其行号由 Numerator 决定,而对于 Denominator-layout,其行号由 Denominator 决定

    目前我比较常遇到的是 Denominator-layout,所以就以 Denominator-layout 的方式来表示 matrix derivative

    the columns of a matrix $A \in \mathbb{R}^{m\times n}$ are $a_1$ through $a_n$, while the rows are $\tilde{a}_1^T$ through $\tilde{a}_n^T$.

    Reference: berkeley.edu, okstate.edu