Hessian Matrix of Convex Functions
Introduction
Hessian matrix is useful for determining whether a function is convex or not. Specifically, a twice differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if and only if its Hessian matrix $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$. Conversely, if we could find an $x \in \mathbb{R}^n$ such that $\nabla^2 f(x)$ is not positive semi-definite, $f$ is not convex.
In this blog post, I would like show a self-contained discussion and proof for this property.
Convex Function Definitions
Here are the definitions of function being convex, strictly convex, and strongly convex. Strongly convex implies strictly convex, and strictly convex implies convex.
Convex Function
A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if its domain $X$ is a convex set and for any $x_1, x_2 \in X$, for all $\lambda \in [0, 1]$, we have
$$
f(\lambda x_1 + (1-\lambda) x_2) \leq \lambda f(x_1) + (1-\lambda) f(x_2)
$$
The local minimum of $f$ is the global minimum of $f$.
Strictly Convex Function
A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is strictly convex if its domain $X$ is a convex set and for any $x_1, x_2 \in X$ and $x_1 \neq x_2$, for all $\lambda \in (0, 1)$, we have
$$
f(\lambda x_1 + (1-\lambda) x_2) < \lambda f(x_1) + (1-\lambda) f(x_2)
$$
The local minimum of $f$ is the global minimum of $f$ and there is only one local minimum.
Strongly Convex Function
A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is strongly convex if there exists $\alpha > 0$ such that $f(x) - \alpha \lVert x \lVert^2$ is convex.
Prerequisites
Directional Derivative
Given the function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, the directional derivative at point $x$ in direction $h$ is defined as
$$
\nabla_{h} f(x) = \lim_{t \rightarrow 0} \frac{f(x + th) - f(x)}{t}
$$
In addition, if the function is differentiable at $x$, we will also define the gradient at $x$, $\nabla f(x)$, such that
$$
\begin{align}
\nabla_{h} f(x)
&= \nabla f(x)^{\top} h\\
\end{align}
$$
There are multiple different definitions for directional derivative. For example, some people defined the directional derivative to be with respect to an arbitrary nonzero vector v after normalization, thus being independent of its magnitude and depending only on its direction.
$$
\nabla_{h} f(x) = \lim_{t \rightarrow 0} \frac{f(x + th) - f(x)}{t \lvert h \rvert}
$$
In this case, the gradient at $x$ is different for this directional derivative definition and is defined as follows.
$$
\begin{align}
\nabla_{h} f(x)
&= \nabla f(x)^{\top} \frac{h}{\lvert h \rvert} \\
\end{align}
$$
The value of the directional derivatives at the same point using the two directional derivative definitions are, obviously, different and they are off by a scale of the norm of $h$. But we could also see that, even though the value of the directional derivatives at the same point using the two directional derivative definitions are different, the value of gradients at the same point using the two gradient definitions are the same.
In this article, we will use the first set of definitions for directional derivative and gradient.
Convex Function Additions
If functions $f, g: \mathbb{R}^n \rightarrow \mathbb{R}$ are both convex, then $f+g$ is also convex.
This could be easily proved using the definition of convex functions.
Proof
Because $f$ and $g$ are both convex functions, we have
$$
\begin{align}
(f+g)(\lambda x_1 + (1-\lambda) x_2) &= f(\lambda x_1 + (1-\lambda) x_2) + g(\lambda x_1 + (1-\lambda) x_2) \\
&\leq \lambda f(x_1) + (1-\lambda) f(x_2) + \lambda g(x_1) + (1-\lambda) g(x_2) \\
&= \lambda (f+g)(x_1) + (1-\lambda) (f+g)(x_2) \\
\end{align}
$$
Therefore,
$$
(f+g)(\lambda x_1 + (1-\lambda) x_2) \leq \lambda (f+g)(x_1) + (1-\lambda) (f+g)(x_2)
$$
$f+g$ must be convex.
This concludes the proof.
Linear Functions are Convex
If function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is linear, then $f$ is convex.
Proof
Suppose $f(x) = a^{\top} x + b$ for some $a \in \mathbb{R}^n$ and $b \in \mathbb{R}$.
$$
\begin{align}
f(\lambda x_1 + (1-\lambda) x_2) &= a^{\top} (\lambda x_1 + (1-\lambda) x_2) + b \\
&= a^{\top} \lambda x_1 + \lambda b + a^{\top} (1-\lambda) x_2 + b -\lambda b\\
&= \lambda (a^{\top} x_1 + b) + (1-\lambda) (a^{\top} x_2 + b) \\
&= \lambda f(x_1) + (1-\lambda) f(x_2) \\
\end{align}
$$
Therefore,
$$
f(\lambda x_1 + (1-\lambda) x_2) = \lambda f(x_1) + (1-\lambda) f(x_2)
$$
$f$ must be convex.
This concludes the proof.
Convex Function Sufficient and Necessary Condition
Assume that function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is differentiable, then $f$ is convex if and only if for every $x, y \in \mathbb{R}^n$, the inequality
$$
f(y) \geq f(x) + \nabla f(x)^{\top} (y - x)
$$
is satisfied.
This can be proved using the definition of convex functions.
Proof
Assume $f$ is convex and we have $x, y \in \mathbb{R}^n$ and $x \neq y$. By the definition of convex functions, we have
$$
f\Big( \frac{1}{2}x + \frac{1}{2}y \Big) \leq \frac{1}{2} f(x) + \frac{1}{2} f(y)
$$
We denote $h = y - x$. Then the inequality becomes
$$
f\Big( x + \frac{1}{2}h \Big) \leq \frac{1}{2} f(x) + \frac{1}{2} f(x + h)
$$
We could subtract $f(x)$ from both side of the inequality without breaking it.
$$
f\Big( x + \frac{1}{2}h \Big) - f(x) \leq \frac{1}{2} f(x + h) - \frac{1}{2} f(x)
$$
and thus
$$
f(x + h) - f(x) \geq \frac{f(x + 2^{-1}h) - f(x)}{2^{-1}}
$$
With this, we could equivalently have
$$
f(x + 2^{1-k}h) - f(x) \geq \frac{f(x + 2^{-k}h) - f(x)}{2^{-1}}
$$
Therefore,
$$
\begin{align}
f(x + h) - f(x) &\geq \frac{f(x + 2^{-1}h) - f(x)}{2^{-1}} \\
&\geq \frac{f(x + 2^{-2}h) - f(x)}{2^{-2}} \\
&\vdots \\
&\geq \frac{f(x + 2^{-k}h) - f(x)}{2^{-k}} \\
\end{align}
$$
Note that according to the definition of directional derivative,
$$
\lim_{k \rightarrow \infty} \frac{f(x + 2^{-k}h) - f(x)}{2^{-k}} = \nabla_{h} f(x) = \nabla f(x)^{\top} h
$$
Therefore,
$$
\begin{align}
f(x + h) - f(x) &\geq \nabla f(x)^{\top} h
\end{align}
$$
We replace $h$ by $y - x$ and got
$$
\begin{align}
f(y) - f(x) &\geq \nabla f(x)^{\top} (y-x)
\end{align}
$$
Therefore,
$$
\begin{align}
f(y) &\geq f(x) + \nabla f(x)^{\top} (y-x)
\end{align}
$$
Now assume for every $x, y \in \mathbb{R}^n$, we have
$$
f(y) \geq f(x) + \nabla f(x)^{\top} (y - x)
$$
For any arbitrary $w, v \in \mathbb{R}^n$, and $0 \leq \lambda \leq 1$, we denote
$$
x = \lambda w + (1 - \lambda) v
$$
Then according to the inequality assumption, we have
$$
\begin{align}
f(w) &\geq f(x) + \nabla f(x)^{\top} (w - x) \\
f(v) &\geq f(x) + \nabla f(x)^{\top} (v - x) \\
\end{align}
$$
Note that
$$
\begin{gather}
v - x = \lambda(v - w) \\
w - x = (1 - \lambda) (w - v) \\
\end{gather}
$$
So we could have the following inequality,
$$
\begin{align}
\lambda f(w) + (1 - \lambda) f(v) &\geq \lambda \Big( f(x) + \nabla f(x)^{\top} (w - x) \Big) + (1 - \lambda) \Big( f(x) + \nabla f(x)^{\top} (v - x) \Big) \\
&= f(x) + \nabla f(x)^{\top} \Big( \lambda (w - x) + (1 - \lambda)(v - x) \Big) \\
&= f(x) + \nabla f(x)^{\top} \Big( \lambda (1 - \lambda) (w - v) + (1 - \lambda)\lambda(v - w) \Big) \\
&= f(x) \\
&= f(\lambda w + (1 - \lambda) v) \\
\end{align}
$$
Because $w$ and $v$ are arbitrary, therefore, by definition, $f$ is a convex function.
This concludes the proof.
Similarly, for strictly convex function, the inequality becomes
$$
f(y) > f(x) + \nabla f(x)^{\top} (y - x)
$$
and we could prove it using the similar fashion as we did for convex function.
Taylor Theorem
The univariate and multivariate Taylor theorem has been discussed and proved in detail in my previous blog post “Taylor Theorem”.
For quadratic multivariate Taylor theorem, let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be a $2$-times differentiable function at the point $a \in \mathbb{R}^n$. Using the Lagrange form of the remainder, there exists $t \in (0, 1)$ such that
$$
\begin{align}
f(x) &= f(a) + \nabla f(a)^{\top} (x-a) + \frac{1}{2} (x-a)^{\top} H(a + t(x - a)) (x-a) \\
\end{align}
$$
First-Order Necessary Conditions for Local Minimizer
If function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is continuously differentiable, suppose $x^{\ast}$ is a local minimizer of $f$, then $\nabla f(x^{\ast}) = 0$.
Proof
Given an arbitrary constant $s \in \mathbb{R}^n$, we construct a new univariate function $g: \mathbb{R} \rightarrow \mathbb{R}$
$$
g(\alpha) = f(x^{\ast} + \alpha s)
$$
Because $x^{\ast}$ is a local minimizer of $f$, it is easy to see that $0$ is a local minimizer of $g$.
According to Taylor theorem,
$$
\begin{align}
g(\alpha) &= g(0) + g^{\prime}(0) (\alpha - 0) + h(\alpha) (\alpha - 0) \\
&= g(0) + g^{\prime}(0) \alpha + h(\alpha) \alpha \\
\end{align}
$$
where
$$
\lim_{\alpha \rightarrow 0} h(\alpha) = 0
$$
Suppose $g^{\prime}(0) \neq 0$, because
$$
\lim_{\alpha \rightarrow 0} h(\alpha) = 0
$$
There must exists an $\varepsilon > 0$ such that for all $0 < \lvert \alpha \rvert < \varepsilon$
$$
g(\alpha) > g(0)
$$
$$
\lvert h(\alpha) \rvert < \big\lvert g^{\prime}(0) \big\rvert
$$
Because for all $0 < \lvert \alpha \rvert < \varepsilon$
$$
\begin{align}
g(\alpha) - g(0) &= g^{\prime}(0) \alpha + h(\alpha) \alpha \\
&= \big( g^{\prime}(0) + h(\alpha) \big) \alpha \\
&> 0
\end{align}
$$
If $g^{\prime}(0) < 0$, then $g^{\prime}(0) + h(\alpha) < 0$, and $\big( g^{\prime}(0) + h(\alpha) \big) \alpha < 0$ when $0 < \alpha < \varepsilon$. This raises a contradiction.
If $g^{\prime}(0) > 0$, then $g^{\prime}(0) + h(\alpha) > 0$, and $\big( g^{\prime}(0) + h(\alpha) \big) \alpha < 0$ when $-\varepsilon < \alpha < 0$. This raises a contradiction.
Therefore,
$$
g^{\prime}(0) = 0
$$
Let’s derive what $g^{\prime}(\alpha)$ is.
$$
\begin{align}
g^{\prime}(\alpha) &= \frac{d g(\alpha)}{ d \alpha } \\
&= \frac{d f(x^{\ast} + \alpha d)}{ d \alpha } \\
\end{align}
$$
By applying the higher-order differentials, we define $u = x^{\ast} + \alpha s$, we have
$$
\begin{align}
g^{\prime}(\alpha)
&= \frac{d f(x^{\ast} + \alpha s)}{ d \alpha } \\
&= \frac{d f(u)}{ d \alpha } \\
&= \frac{\nabla f(u)^{\top} d u}{ d \alpha } \\
&= \frac{\nabla f(u)^{\top} s d \alpha}{ d \alpha } \\
&= \nabla f(u)^{\top} s \\
&= \nabla f(x^{\ast} + \alpha s)^{\top} s \\
\end{align}
$$
where $\nabla f^{\top}$ is a row vector and $s$ is a column vector. Specifically,
$$
\nabla f^{\top} = \bigg[ \frac{\partial f}{\partial u_1}, \frac{\partial f}{\partial u_2}, \cdots, \frac{\partial f}{\partial u_n} \bigg]
$$
Therefore,
$$
\begin{align}
g^{\prime}(0) &= \nabla f(x^{\ast})^{\top} s \\
&= 0 \\
\end{align}
$$
Because $s$ is arbitrary, therefore,
$$
\nabla f(x^{\ast}) = 0
$$
This concludes the proof.
Global Minimizer
If function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex and differentiable, then $x^{\ast}$ is a global minimizer of $f$ if and only if $\nabla f(x^{\ast}) = 0$.
Proof
Assume $x^{\ast}$ is a global minimizer, then $x^{\ast}$ is also a local minimizer. According to the first-order necessary conditions for local minimizer, $\nabla f(x^{\ast}) = 0$.
Now assume $\nabla f(x^{\ast}) = 0$, according to the sufficient and necessary condition for convex functions, we must have
$$
\begin{align}
f(y) &\geq f(x^{\ast}) + \nabla f(x^{\ast})^{\top} (y - x^{\ast}) \\
&\geq f(x^{\ast}) \\
\end{align}
$$
Because $y$ is arbitrary, therefore $x^{\ast}$ is a global minimizer.
This concludes the proof.
Second-Order Necessary Conditions for Local Minimizer
If function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is $2$ times continuously differentiable, suppose $x^{\ast}$ is a local minimizer of $f$, then the Hessian matrix $H(x^{\ast})$, i.e., $\nabla^2 f(x^{\ast})$, must be positive semi-definite.
Proof
Given an arbitrary constant $s \in \mathbb{R}^n$, we construct a new univariate function $g: \mathbb{R} \rightarrow \mathbb{R}$
$$
g(\alpha) = f(x^{\ast} + \alpha s)
$$
According to Taylor theorem,
$$
\begin{align}
g(\alpha) &= g(0) + g^{\prime}(0) (\alpha - 0) + \frac{1}{2} g^{\prime\prime}(0) (\alpha - 0)^2 + h(\alpha) (\alpha - 0)^2 \\
&= g(0) + g^{\prime}(0) \alpha + \frac{1}{2} g^{\prime\prime}(0) \alpha^2 + h(\alpha) \alpha^2 \\
\end{align}
$$
where
$$
\lim_{\alpha \rightarrow 0} h(\alpha) = 0
$$
Because $x^{\ast}$ is a local minimizer of $f$, it is easy to see that $0$ is a local minimizer of $g$. By the first-order necessary condition for local minimum, we must have
$$
g^{\prime}(0) = 0
$$
We further claim that
$$
g^{\prime\prime}(0) \geq 0
$$
and we will prove it by contradiction.
Because
$$
\lim_{\alpha \rightarrow 0} h(\alpha) = 0
$$
there must exists an $\varepsilon > 0$ such that for all $0 < \lvert \alpha \rvert < \varepsilon$
$$
g(\alpha) > g(0)
$$
$$
h(\alpha) < \bigg\lvert \frac{1}{2} g^{\prime\prime}(0) \bigg\rvert
$$
We further multiply the non-negative $\alpha^2$ on both sides of the inequality.
$$
h(\alpha) \alpha^2 < \bigg\lvert \frac{1}{2} g^{\prime\prime}(0) \bigg\rvert \alpha^2
$$
Because for all $0 < \lvert \alpha \rvert < \varepsilon$
$$
\begin{align}
g(\alpha) - g(0) &= g^{\prime}(0) \alpha + \frac{1}{2} g^{\prime\prime}(0) \alpha^2 + h(\alpha) \alpha^2 \\
&= \frac{1}{2} g^{\prime\prime}(0) \alpha^2 + h(\alpha) \alpha^2 \\
&> 0
\end{align}
$$
Suppose $g^{\prime\prime}(0) < 0$, then $\frac{1}{2} g^{\prime\prime}(0) \alpha^2 + h(\alpha) \alpha^2 < 0$. This raises a contradiction.
Therefore,
$$
g^{\prime\prime}(0) \geq 0
$$
We have shown that
$$
\begin{align}
g^{\prime}(\alpha)
&= \nabla f(u)^{\top} s \\
\end{align}
$$
where $u = x^{\ast} + \alpha s$, $\nabla f^{\top}$ is a row vector and $s$ is a column vector. Specifically,
$$
\nabla f^{\top} = \bigg[ \frac{\partial f}{\partial u_1}, \frac{\partial f}{\partial u_2}, \cdots, \frac{\partial f}{\partial u_n} \bigg]
$$
If we expand the dot product, we have
$$
\begin{align}
g^{\prime}(\alpha)
&= \sum_{i=1}^{n} \frac{\partial f(u)}{\partial u_i} s_i \\
\end{align}
$$
Similarly,
$$
\begin{align}
g^{\prime\prime}(\alpha)
&= \sum_{i=1}^{n} \frac{d \frac{\partial f(u)}{\partial u_i}}{d \alpha} s_i \\
&= \sum_{i=1}^{n} \frac{ \nabla \frac{\partial f(u)}{\partial u_i}^{\top} du}{d \alpha} s_i \\
&= \sum_{i=1}^{n} \nabla \frac{\partial f(u)}{\partial u_i}^{\top} \frac{ du}{d \alpha} s_i \\
&= \sum_{i=1}^{n} \nabla \frac{\partial f(u)}{\partial u_i}^{\top} s s_i \\
\end{align}
$$
where
$$
\nabla \frac{\partial f}{\partial u_i}^{\top} = \bigg[ \frac{\partial^2 f}{\partial u_i \partial u_1}, \frac{\partial^2 f}{\partial u_i \partial u_2}, \cdots, \frac{\partial^2 f}{\partial u_i \partial u_n} \bigg]
$$
If we expand the dot product, we have
$$
\begin{align}
g^{\prime\prime}(\alpha)
&= \sum_{i=1}^{n} \nabla \frac{\partial f(u)}{\partial u_i}^{\top} s s_i \\
&= \sum_{i=1}^{n} \bigg( \sum_{j=1}^{n} \frac{\partial^2 f(u)}{\partial u_i \partial u_j} s_j \bigg) s_i \\
&= \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{\partial^2 f(u)}{\partial u_i \partial u_j} s_i s_j \\
&= \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{\partial^2 f(x^{\ast} + \alpha s)}{\partial u_i \partial u_j} s_i s_j \\
&= s^{\top} \nabla^2 f(u) s \\
&= s^{\top} \nabla^2 f(x^{\ast} + \alpha s) s \\
\end{align}
$$
Therefore,
$$
\begin{align}
g^{\prime\prime}(0)
&= s^{\top} \nabla^2 f(x^{\ast}) s \\
&\geq 0 \\
\end{align}
$$
Because $s$ is arbitrary, therefore, $\nabla^2 f(x^{\ast})$ is positive semi-definite.
This concludes the proof.
Hessian Matrix
A twice differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if and only if its Hessian matrix $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$.
Proof
Assume $f$ is convex and we have an arbitrary value $a \in \mathbb{R}^n$. We define a new function $g: \mathbb{R}^n \rightarrow \mathbb{R}$.
$$
g(y) = f(y) - \nabla f(a)^{\top} (y - a)
$$
Because $- \nabla f(a)^{\top} (y - a)$ is a linear function, thus it is convex. Because $g(y)$ is the summation of two convex functions, $f(y)$ and $- \nabla f(a)^{\top} (y - a)$, thus it is convex.
The first-order and second-order partial derivatives of $g$ are
$$
\nabla g(y) = \nabla f(y) - \nabla f(a)
$$
$$
\nabla^2 g(y) = \nabla^2 f(y)
$$
for all $y \in \mathbb{R}^n$.
We also note that
$$
\begin{align}
\nabla g(a) = \nabla f(a) - \nabla f(a) = 0
\end{align}
$$
From the property of the global minimizer of a convex and differentiable function, we know that $a$ is a global minimizer of $g$. Based on the second-order necessary condition for a local minimizer (global minimizer is also a local minimizer), $\nabla^2 g(a)$ is positive semi-definite. Because $\nabla^2 g(a) = \nabla^2 f(a)$, $\nabla^2 f(a)$ is also positive semi-definite. Because $a \in \mathbb{R}^n$ is arbitrary, the Hessian matrix $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$.
Assume the Hessian matrix $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$, according to Taylor’s theorem,
$$
f(y) = f(x) + \nabla f(x)^{\top} (y-x) + \frac{1}{2} (y-x)^{\top} \nabla^2 f(x + t(y-x)) (y-x)
$$
for some $0 \leq t \leq 1$.
Because $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$, $(y-x)^{\top} \nabla^2 f(x + t(y-x)) (y-x) \geq 0$. Therefore,
$$
f(y) \geq f(x) + \nabla f(x)^{\top} (y-x)
$$
This proves tha $f$ is convex.
This concludes the proof.
References
Hessian Matrix of Convex Functions
https://leimao.github.io/blog/Hessian-Matrix-Convex-Functions/