### Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

# Hessian Matrix of Convex Functions

### Introduction

Hessian matrix is useful for determining whether a function is convex or not. Specifically, a twice differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if and only if its Hessian matrix $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$. Conversely, if we could find an $x \in \mathbb{R}^n$ such that $\nabla^2 f(x)$ is not positive semi-definite, $f$ is not convex.

In this blog post, I would like show a self-contained discussion and proof for this property.

### Convex Function Definitions

Here are the definitions of function being convex, strictly convex, and strongly convex. Strongly convex implies strictly convex, and strictly convex implies convex.

#### Convex Function

A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if its domain $X$ is a convex set and for any $x_1, x_2 \in X$, for all $\lambda \in [0, 1]$, we have

$f(\lambda x_1 + (1-\lambda) x_2) \leq \lambda f(x_1) + (1-\lambda) f(x_2)$

The local minimum of $f$ is the global minimum of $f$.

#### Strictly Convex Function

A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is strictly convex if its domain $X$ is a convex set and for any $x_1, x_2 \in X$ and $x_1 \neq x_2$, for all $\lambda \in (0, 1)$, we have

$f(\lambda x_1 + (1-\lambda) x_2) < \lambda f(x_1) + (1-\lambda) f(x_2)$

The local minimum of $f$ is the global minimum of $f$ and there is only one local minimum.

#### Strongly Convex Function

A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is strongly convex if there exists $\alpha > 0$ such that $f(x) - \alpha \lVert x \lVert^2$ is convex.

### Prerequisites

#### Directional Derivative

Given the function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, the directional derivative is defined as

$\nabla_{h} f(x) = \lim_{t \rightarrow 0} \frac{f(x + th) - f(x)}{t}$

If the function is differentiable at $x$, we have

\begin{align} \nabla_{h} f(x) &= \lim_{t \rightarrow 0} \frac{f(x + th) - f(x)}{t} \\ &= \lim_{t \rightarrow 0} \frac{f(x + th) - f(x)}{th} h \\ &= \bigg( \lim_{t \rightarrow 0} \frac{f(x + th) - f(x)}{th} \bigg) h\\ &= \bigg( \lim_{th \rightarrow 0} \frac{f(x + th) - f(x)}{th} \bigg) h\\ &= \nabla f(x)^{\top} h\\ \end{align}

Namely,

\begin{align} \nabla_{h} f(x) &= \nabla f(x)^{\top} h\\ \end{align}

If functions $f, g: \mathbb{R}^n \rightarrow \mathbb{R}$ are both convex, then $f+g$ is also convex.

This could be easily proved using the definition of convex functions.

Proof

Because $f$ and $g$ are both convex functions, we have

\begin{align} (f+g)(\lambda x_1 + (1-\lambda) x_2) &= f(\lambda x_1 + (1-\lambda) x_2) + g(\lambda x_1 + (1-\lambda) x_2) \\ &\leq \lambda f(x_1) + (1-\lambda) f(x_2) + \lambda g(x_1) + (1-\lambda) g(x_2) \\ &= \lambda (f+g)(x_1) + (1-\lambda) (f+g)(x_2) \\ \end{align}

Therefore,

$(f+g)(\lambda x_1 + (1-\lambda) x_2) \leq \lambda (f+g)(x_1) + (1-\lambda) (f+g)(x_2)$

$f+g$ must be convex.

This concludes the proof.

#### Linear Functions are Convex

If function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is linear, then $f$ is convex.

Proof

Suppose $f(x) = a^{\top} x + b$ for some $a \in \mathbb{R}^n$ and $b \in \mathbb{R}$.

\begin{align} f(\lambda x_1 + (1-\lambda) x_2) &= a^{\top} (\lambda x_1 + (1-\lambda) x_2) + b \\ &= a^{\top} \lambda x_1 + \lambda b + a^{\top} (1-\lambda) x_2 + b -\lambda b\\ &= \lambda (a^{\top} x_1 + b) + (1-\lambda) (a^{\top} x_2 + b) \\ &= \lambda f(x_1) + (1-\lambda) f(x_2) \\ \end{align}

Therefore,

$f(\lambda x_1 + (1-\lambda) x_2) = \lambda f(x_1) + (1-\lambda) f(x_2)$

$f$ must be convex.

This concludes the proof.

#### Convex Function Sufficient and Necessary Condition

Assume that function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is differentiable, then $f$ is convex if and only if for every $x, y \in \mathbb{R}^n$, the inequality

$f(y) \geq f(x) + \nabla f(x)^{\top} (y - x)$

is satisfied.

This can be proved using the definition of convex functions.

Proof

Assume $f$ is convex and we have $x, y \in \mathbb{R}^n$ and $x \neq y$. By the definition of convex functions, we have

$f\Big( \frac{1}{2}x + \frac{1}{2}x \Big) \leq \frac{1}{2} f(x) + \frac{1}{2} f(y)$

We denote $h = y - x$. Then the inequality becomes

$f\Big( x + \frac{1}{2}h \Big) \leq \frac{1}{2} f(x) + \frac{1}{2} f(x + h)$

We could subtract $f(x)$ from both side of the inequality without breaking it.

$f\Big( x + \frac{1}{2}h \Big) - f(x) \leq \frac{1}{2} f(x + h) - \frac{1}{2} f(x)$

and thus

$f(x + h) - f(x) \geq \frac{f(x + 2^{-1}h) - f(x)}{2^{-1}}$

With this, we could equivalently have

$f(x + 2^{1-k}h) - f(x) \geq \frac{f(x + 2^{-k}h) - f(x)}{2^{-1}}$

Therefore,

\begin{align} f(x + h) - f(x) &\geq \frac{f(x + 2^{-1}h) - f(x)}{2^{-1}} \\ &\geq \frac{f(x + 2^{-2}h) - f(x)}{2^{-2}} \\ &\vdots \\ &\geq \frac{f(x + 2^{-k}h) - f(x)}{2^{-k}} \\ \end{align}

Note that according to the definition of directional derivative,

$\lim_{k \rightarrow \infty} \frac{f(x + 2^{-k}h) - f(x)}{2^{-k}} = \nabla_{h} f(x) = \nabla f(x)^{\top} h$

Therefore,

\begin{align} f(x + h) - f(x) &\geq \nabla f(x)^{\top} h \end{align}

We replace $h$ by $y - x$ and got

\begin{align} f(y) - f(x) &\geq \nabla f(x)^{\top} (y-x) \end{align}

Therefore,

\begin{align} f(y) &\geq f(x) + \nabla f(x)^{\top} (y-x) \end{align}

Now assume for every $x, y \in \mathbb{R}^n$, we have

$f(y) \geq f(x) + \nabla f(x)^{\top} (y - x)$

For any arbitrary $w, v \in \mathbb{R}^n$, and $0 \leq \lambda \leq 1$, we denote

$x = \lambda w + (1 - \lambda) v$

Then according to the inequality assumption, we have

\begin{align} f(w) &\geq f(x) + \nabla f(x)^{\top} (w - x) \\ f(v) &\geq f(x) + \nabla f(x)^{\top} (v - x) \\ \end{align}

Note that

$\begin{gather} v - x = \lambda(v - w) \\ w - x = (1 - \lambda) (w - v) \\ \end{gather}$

So we could have the following inequality,

\begin{align} \lambda f(w) + (1 - \lambda) f(v) &\geq \lambda \Big( f(x) + \nabla f(x)^{\top} (w - x) \Big) + (1 - \lambda) \Big( f(x) + \nabla f(x)^{\top} (v - x) \Big) \\ &= f(x) + \nabla f(x)^{\top} \Big( \lambda (w - x) + (1 - \lambda)(v - x) \Big) \\ &= f(x) + \nabla f(x)^{\top} \Big( \lambda (1 - \lambda) (w - v) + (1 - \lambda)\lambda(v - w) \Big) \\ &= f(x) \\ &= f(\lambda w + (1 - \lambda) v) \\ \end{align}

Because $w$ and $v$ are arbitrary, therefore, by definition, $f$ is a convex function.

This concludes the proof.

Similarly, for strictly convex function, the inequality becomes

$f(y) > f(x) + \nabla f(x)^{\top} (y - x)$

and we could prove it using the similar fashion as we did for convex function.

#### Taylor Theorem

The univariate and multivariate Taylor theorem has been discussed and proved in detail in my previous blog post “Taylor Theorem”.

For quadratic multivariate Taylor theorem, let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be a $2$-times differentiable function at the point $a \in \mathbb{R}^n$. Using the Lagrange form of the remainder, there exists $t \in (0, 1)$ such that

\begin{align} f(x) &= f(a) + \nabla f(a)^{\top} (x-a) + \frac{1}{2} (x-a)^{\top} H(a + t(x - a)) (x-a) \\ \end{align}

#### First-Order Necessary Conditions for Local Minimizer

If function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is continuously differentiable, suppose $x^{\ast}$ is a local minimizer of $f$, then $\nabla f(x^{\ast}) = 0$.

Proof

Given an arbitrary constant $s \in \mathbb{R}^n$, we construct a new univariate function $g: \mathbb{R} \rightarrow \mathbb{R}$

$g(\alpha) = f(x^{\ast} + \alpha s)$

Because $x^{\ast}$ is a local minimizer of $f$, it is easy to see that $0$ is a local minimizer of $g$.

According to Taylor theorem,

\begin{align} g(\alpha) &= g(0) + g^{\prime}(0) (\alpha - 0) + h(\alpha) (\alpha - 0) \\ &= g(0) + g^{\prime}(0) \alpha + h(\alpha) \alpha \\ \end{align}

where

$\lim_{\alpha \rightarrow 0} h(\alpha) = 0$

Suppose $g^{\prime}(0) \neq 0$, because

$\lim_{\alpha \rightarrow 0} h(\alpha) = 0$

There must exists an $\varepsilon > 0$ such that for all $0 < \lvert \alpha \rvert < \varepsilon$

$g(\alpha) > g(0)$ $\lvert h(\alpha) \rvert < \big\lvert g^{\prime}(0) \big\rvert$

Because for all $0 < \lvert \alpha \rvert < \varepsilon$

\begin{align} g(\alpha) - g(0) &= g^{\prime}(0) \alpha + h(\alpha) \alpha \\ &= \big( g^{\prime}(0) + h(\alpha) \big) \alpha \\ &> 0 \end{align}

If $g^{\prime}(0) < 0$, then $g^{\prime}(0) + h(\alpha) < 0$, and $\big( g^{\prime}(0) + h(\alpha) \big) \alpha < 0$ when $0 < \alpha < \varepsilon$. This raises a contradiction.

If $g^{\prime}(0) > 0$, then $g^{\prime}(0) + h(\alpha) > 0$, and $\big( g^{\prime}(0) + h(\alpha) \big) \alpha < 0$ when $-\varepsilon < \alpha < 0$. This raises a contradiction.

Therefore,

$g^{\prime}(0) = 0$

Let’s derive what $g^{\prime}(\alpha)$ is.

\begin{align} g^{\prime}(\alpha) &= \frac{d g(\alpha)}{ d \alpha } \\ &= \frac{d f(x^{\ast} + \alpha d)}{ d \alpha } \\ \end{align}

By applying the higher-order differentials, we define $u = x^{\ast} + \alpha s$, we have

\begin{align} g^{\prime}(\alpha) &= \frac{d f(x^{\ast} + \alpha s)}{ d \alpha } \\ &= \frac{d f(u)}{ d \alpha } \\ &= \frac{\nabla f(u)^{\top} d u}{ d \alpha } \\ &= \frac{\nabla f(u)^{\top} s d \alpha}{ d \alpha } \\ &= \nabla f(u)^{\top} s \\ &= \nabla f(x^{\ast} + \alpha s)^{\top} s \\ \end{align}

where $\nabla f^{\top}$ is a row vector and $s$ is a column vector. Specifically,

$\nabla f^{\top} = \bigg[ \frac{\partial f}{\partial u_1}, \frac{\partial f}{\partial u_2}, \cdots, \frac{\partial f}{\partial u_n} \bigg]$

Therefore,

\begin{align} g^{\prime}(0) &= \nabla f(x^{\ast})^{\top} s \\ &= 0 \\ \end{align}

Because $s$ is arbitrary, therefore,

$\nabla f(x^{\ast}) = 0$

This concludes the proof.

#### Global Minimizer

If function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex and differentiable, then $x^{\ast}$ is a global minimizer of $f$ if and only if $\nabla f(x^{\ast}) = 0$.

Proof

Assume $x^{\ast}$ is a global minimizer, then $x^{\ast}$ is also a local minimizer. According to the first-order necessary conditions for local minimizer, $\nabla f(x^{\ast}) = 0$.

Now assume $\nabla f(x^{\ast}) = 0$, according to the sufficient and necessary condition for convex functions, we must have

\begin{align} f(y) &\geq f(x^{\ast}) + \nabla f(x^{\ast})^{\top} (y - x^{\ast}) \\ &\geq f(x^{\ast}) \\ \end{align}

Because $y$ is arbitrary, therefore $x^{\ast}$ is a global minimizer.

This concludes the proof.

#### Second-Order Necessary Conditions for Local Minimizer

If function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is $2$ times continuously differentiable, suppose $x^{\ast}$ is a local minimizer of $f$, then the Hessian matrix $H(x^{\ast})$, i.e., $\nabla^2 f(x^{\ast})$, must be positive semi-definite.

Proof

Given an arbitrary constant $s \in \mathbb{R}^n$, we construct a new univariate function $g: \mathbb{R} \rightarrow \mathbb{R}$

$g(\alpha) = f(x^{\ast} + \alpha s)$

According to Taylor theorem,

\begin{align} g(\alpha) &= g(0) + g^{\prime}(0) (\alpha - 0) + \frac{1}{2} g^{\prime\prime}(0) (\alpha - 0)^2 + h(\alpha) (\alpha - 0)^2 \\ &= g(0) + g^{\prime}(0) \alpha + \frac{1}{2} g^{\prime\prime}(0) \alpha^2 + h(\alpha) \alpha^2 \\ \end{align}

where

$\lim_{\alpha \rightarrow 0} h(\alpha) = 0$

Because $x^{\ast}$ is a local minimizer of $f$, it is easy to see that $0$ is a local minimizer of $g$. By the first-order necessary condition for local minimum, we must have

$g^{\prime}(0) = 0$

We further claim that

$g^{\prime\prime}(0) \geq 0$

and we will prove it by contradiction.

Because

$\lim_{\alpha \rightarrow 0} h(\alpha) = 0$

there must exists an $\varepsilon > 0$ such that for all $0 < \lvert \alpha \rvert < \varepsilon$

$g(\alpha) > g(0)$ $h(\alpha) < \bigg\lvert \frac{1}{2} g^{\prime\prime}(0) \bigg\rvert$

We further multiply the non-negative $\alpha^2$ on both sides of the inequality.

$h(\alpha) \alpha^2 < \bigg\lvert \frac{1}{2} g^{\prime\prime}(0) \bigg\rvert \alpha^2$

Because for all $0 < \lvert \alpha \rvert < \varepsilon$

\begin{align} g(\alpha) - g(0) &= g^{\prime}(0) \alpha + \frac{1}{2} g^{\prime\prime}(0) \alpha^2 + h(\alpha) \alpha^2 \\ &= \frac{1}{2} g^{\prime\prime}(0) \alpha^2 + h(\alpha) \alpha^2 \\ &> 0 \end{align}

Suppose $g^{\prime\prime}(0) < 0$, then $\frac{1}{2} g^{\prime\prime}(0) \alpha^2 + h(\alpha) \alpha^2 < 0$. This raises a contradiction.

Therefore,

$g^{\prime\prime}(0) \geq 0$

We have shown that

\begin{align} g^{\prime}(\alpha) &= \nabla f(u)^{\top} s \\ \end{align}

where $u = x^{\ast} + \alpha s$, $\nabla f^{\top}$ is a row vector and $s$ is a column vector. Specifically,

$\nabla f^{\top} = \bigg[ \frac{\partial f}{\partial u_1}, \frac{\partial f}{\partial u_2}, \cdots, \frac{\partial f}{\partial u_n} \bigg]$

If we expand the dot product, we have

\begin{align} g^{\prime}(\alpha) &= \sum_{i=1}^{n} \frac{\partial f(u)}{\partial u_i} s_i \\ \end{align}

Similarly,

\begin{align} g^{\prime\prime}(\alpha) &= \sum_{i=1}^{n} \frac{d \frac{\partial f(u)}{\partial u_i}}{d \alpha} s_i \\ &= \sum_{i=1}^{n} \frac{ \nabla \frac{\partial f(u)}{\partial u_i}^{\top} du}{d \alpha} s_i \\ &= \sum_{i=1}^{n} \nabla \frac{\partial f(u)}{\partial u_i}^{\top} \frac{ du}{d \alpha} s_i \\ &= \sum_{i=1}^{n} \nabla \frac{\partial f(u)}{\partial u_i}^{\top} s s_i \\ \end{align}

where

$\nabla \frac{\partial f}{\partial u_i}^{\top} = \bigg[ \frac{\partial^2 f}{\partial u_i \partial u_1}, \frac{\partial f}{\partial u_i \partial u_2}, \cdots, \frac{\partial f}{\partial u_i \partial u_n} \bigg]$

If we expand the dot product, we have

\begin{align} g^{\prime\prime}(\alpha) &= \sum_{i=1}^{n} \nabla \frac{\partial f(u)}{\partial u_i}^{\top} s s_i \\ &= \sum_{i=1}^{n} \bigg( \sum_{j=1}^{n} \frac{\partial^2 f(u)}{\partial u_i \partial u_j} s_j \bigg) s_i \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{\partial^2 f(u)}{\partial u_i \partial u_j} s_i s_j \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{\partial^2 f(x^{\ast} + \alpha s)}{\partial u_i \partial u_j} s_i s_j \\ &= s^{\top} \nabla^2 f(u) s \\ &= s^{\top} \nabla^2 f(x^{\ast} + \alpha s) s \\ \end{align}

Therefore,

\begin{align} g^{\prime\prime}(0) &= s^{\top} \nabla^2 f(x^{\ast}) s \\ &\geq 0 \\ \end{align}

Because $s$ is arbitrary, therefore, $\nabla^2 f(x^{\ast})$ is positive semi-definite.

This concludes the proof.

### Hessian Matrix

A twice differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if and only if its Hessian matrix $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$.

Proof

Assume $f$ is convex and we have an arbitrary value $a \in \mathbb{R}^n$. We define a new function $g: \mathbb{R}^n \rightarrow \mathbb{R}$.

$g(y) = f(y) - \nabla f(a)^{\top} (y - a)$

Because $- \nabla f(a)^{\top} (y - a)$ is a linear function, thus it is convex. Because $g(y)$ is the summation of two convex functions, $f(y)$ and $- \nabla f(a)^{\top} (y - a)$, thus it is convex.

The first-order and second-order partial derivatives of $g$ are

$\nabla g(y) = \nabla f(y) - \nabla f(a)$ $\nabla^2 g(y) = \nabla^2 f(y)$

for all $y \in \mathbb{R}^n$.

We also note that

\begin{align} \nabla g(a) = \nabla f(a) - \nabla f(a) = 0 \end{align}

From the property of the global minimizer of a convex and differentiable function, we know that $a$ is a global minimizer of $g$. Based on the second-order necessary condition for a local minimizer (global minimizer is also a local minimizer), $\nabla^2 g(a)$ is positive semi-definite. Because $\nabla^2 g(a) = \nabla^2 f(a)$, $f(a)$ is also positive semi-definite. Because $a \in \mathbb{R}^n$ is arbitrary, the Hessian matrix $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$.

Assume the Hessian matrix $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$, according to Taylor’s theorem,

$f(y) = f(x) + \nabla f(x)^{\top} (y-x) + \frac{1}{2} (y-x)^{\top} \nabla^2 f(x + t(y-x)) (y-x)$

for some $0 \leq t \leq 1$.

Because $\nabla^2 f(x)$ is positive semi-definite for all $x \in \mathbb{R}^n$, $(y-x)^{\top} \nabla^2 f(x + t(y-x)) (y-x) \geq 0$. Therefore,

$f(y) \geq f(x) + \nabla f(x)^{\top} (y-x)$

This proves tha $f$ is convex.

This concludes the proof.