# Floating Point Value Binary Representation

## Introduction

The fixed-bitwidth floating point value binary representation is a mysterious for most of the programmers and few people cared about what it is and how the floating point arithmetic works.

In this blog post, I would like to discuss the floating point formats, its general representations, and how the basic floating point arithmetic is performed on the processors compared to the integer arithmetic.

## Floating Point Formats

### Double (FP64)

The binary representation of Double (FP64), the default value representation in modern C/C++, consists of 64 bits.

• Sign: 1 bit
• Exponent: 11 bits
• Fraction: 52 bits

The floating point value it represents is

$$(-1)^{b_{63}}(1.b_{51} b_{50} \cdots b_{0})_{2} \times 2^{(b_{62} b_{61} \cdots b_{52})_{2}-2^{10}+1}$$

where

\begin{align} (1.b_{51} b_{50} \cdots b_{0})_{2} = 1 + \sum_{i=1}^{52} b_{52 - i} 2^{-i} \end{align}

\begin{align} (b_{62} b_{61} \cdots b_{52})_{2} = \sum_{i=0}^{10} b_{52 + i} 2^{i} \in [0, 2^{11}- 1] \end{align}

Because $2^{2^{11}- 1 - 2^{10} + 1} = 2^{1024} \approx 1.798 \times 10^{308}$, numbers between $-10^{308}$ and $10^{308}$ can be represented by FP64.

Because $2^{-52} \approx 2.22 \times 10^{-16}$, FP64 should give at least 16 significant digits.

In addition, because the bitwidth is finite and there are infinite number of numbers $-10^{308}$ and $10^{308}$, not every number could be represented exactly by FP64. Numbers that cannot be represented by FP64 exactly will be rounded to the closest representable values.

### General Representation

More generally, a floating point value could be represented using $m + n + 1$ number of bits.

• Sign: $1$ bit
• Exponent: $m$ bits
• Fraction: $n$ bits

The floating point value it represents is

$$(-1)^{b_{m+n}}(1.b_{n-1} b_{n-2} \cdots b_{0})_{2} \times 2^{(b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2}-2^{m-1}+1}$$

where

\begin{align} (1.b_{n-1} b_{n-2} \cdots b_{0})_{2} = 1 + \sum_{i=1}^{n} b_{n - i} 2^{-i} \end{align}

\begin{align} (b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2} = \sum_{i=0}^{m-1} b_{n + i} 2^{i} \end{align}

In contrast to the integer binary representation which represents values uniformly on a linear scale, the exponent of the floating point binary representation represents values uniformly on a log2 scale and the fraction of the floating point binary representation represents values uniformly on a linear scale.

Specifically, the spacing between $2^{i-1}$ and $2^{i}$ on a log2 scale is $\lvert \log_2{2^{i}} - \log_2{2^{i-1}} \rvert = 1$. There are $2^{n}$ values between $2^{i-1}$ and $2^{i}$. The spacing between any two adjacent values between $2^{i-1}$ and $2^{i}$ is $2^{-n} \cdot 2^{i-1}$ on a linear scale.

Therefore, he more number of bits in exponent, the larger dynamic range of the value the type type could represent. The more number of bits in fraction, the higher precision the value of the type could achieve.

### Special Values

In fact, in the IEEE standard for floating point numbers, not all the floating point values are calculated from the binary representations using the recipe above. The floating point values that are not calculated from the binary representations using the recipe above are called special values.

For example, it’s immediately obvious that using the recipe above, the binary representation could not represent the value of 0. So the IEEE standard defines that the binary representation for 0 has all the bits 0 for the exponent and fraction. This also leads to the two representations for 0, +0 and -0.

Using the recipe above, the smallest number the binary representation could present is $2^{-2^{m-1} + 1}$. However, the IEEE standard defines subnormal numbers which are smaller than $2^{-2^{m-1} + 1}$ when all the bits for the exponent are 0 and not all the bits for the fraction are 0. In this case, the floating point value it represents becomes

$$(-1)^{b_{m+n}}(0.b_{n-1} b_{n-2} \cdots b_{0})_{2} \times 2^{1-2^{m-1}+1}$$

Notice that the implicit leading bit has become $0$ instead of $1$ and the exponent has become $2^{1-2^{m-1}+1}$ instead of $2^{-2^{m-1}+1}$.

There are other special values, such as infinity, NaN, largest and smallest positive numbers, defined by the IEEE standard, which we will not elaborate here.

### Floating Point Variants

There are other floating point variants, some of which are specialized for machine learning purposes.

Type Sign Exponent Fraction Total
FP16 1 5 10 16
FP32 1 8 23 32
FP64 1 11 52 64
TF32 1 8 10 19
BF16 1 8 7 16
FP8 1 5 2 8

## Floating Point Arithmetic on Processor

The floating point binary arithmetic is actually based on the integer binary arithmetic. Here we assume the arithmetic does not involve special values defined by the IEEE standard.

Suppose we have two floating point values $v_1$ and $v_2$, and the resulting floating point value $v_3$.

\begin{align} v_1 &= (-1)^{b_{m+n}}(1.b_{n-1} b_{n-2} \cdots b_{0})_{2} \times 2^{(b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2}-2^{m-1}+1} \\ v_2 &= (-1)^{b_{m+n}^{\prime}}(1.b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{0}^{\prime})_{2} \times 2^{(b_{m+n-1}^{\prime} b_{m+n-2}^{\prime} \cdots b_{n}^{\prime})_{2}-2^{m-1}+1} \\ v_3 &= (-1)^{b_{m+n}^{\prime\prime}}(1.b_{n-1}^{\prime\prime} b_{n-2}^{\prime\prime} \cdots b_{0}^{\prime\prime})_{2} \times 2^{(b_{m+n-1}^{\prime\prime} b_{m+n-2}^{\prime\prime} \cdots b_{n}^{\prime\prime})_{2}-2^{m-1}+1} \\ \end{align}

Without loss of generality, suppose $b_{m+n} = b_{m+n}^{\prime}$ and $(b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2} \geq (b_{m+n-1}^{\prime} b_{m+n-2}^{\prime} \cdots b_{n}^{\prime})_{2}$ and

$$(k)_{10} = (b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2} - (b_{m+n-1}^{\prime} b_{m+n-2}^{\prime} \cdots b_{n}^{\prime})_{2}$$

Then $v_2$ could be converted a new intermediate representation, possibly with losing precisions,

\begin{align} v_2^{\prime} &= (-1)^{b_{m+n}^{\prime}}(\underbrace{0.00 \cdots 0}_{k}1 b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{k}^{\prime})_{2} \times 2^k \times 2^{(b_{m+n-1}^{\prime} b_{m+n-2}^{\prime} \cdots b_{n}^{\prime})_{2}-2^{m-1}+1} \\ &= (-1)^{b_{m+n}^{\prime}}(\underbrace{0.00 \cdots 0}_{k}1 b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{k}^{\prime})_{2} \times 2^{(b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2}-2^{m-1}+1} \\ &\approx v_2 \\ \end{align}

$$v_3 = v_1 + v_2^{\prime}$$

Then $v_1$ and $v_2$ could perform bitwise addition just like integer bitwise addition for $(1.b_{n-1} b_{n-2} \cdots b_{0})_{2}$ and $(\underbrace{0.00 \cdots 0}_{k}1 b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{k}^{\prime})_{2}$.

$$\begin{gather} (1.b_{n-1}^{\prime\prime} b_{n-2}^{\prime\prime} \cdots b_{0}^{\prime\prime})_{2} = (1.b_{n-1} b_{n-2} \cdots b_{0})_{2} + (\underbrace{0.00 \cdots 0}_{k}1 b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{k}^{\prime})_{2} \\ (b_{m+n-1}^{\prime\prime} b_{m+n-2}^{\prime\prime} \cdots b_{n}^{\prime\prime})_{2} = (b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2} \\ \end{gather}$$

If $(1.b_{n-1} b_{n-2} \cdots b_{0})_{2} + (\underbrace{0.00 \cdots 0}_{k}1 b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{k}^{\prime})_{2} \geq (1)_{2} = (2)_{10}$, the exponent will have to add $(1)_{2}$, i.e.,

$$(b_{m+n-1}^{\prime\prime} b_{m+n-2}^{\prime\prime} \cdots b_{n}^{\prime\prime})_{2} \leftarrow (b_{m+n-1}^{\prime\prime} b_{m+n-2}^{\prime\prime} \cdots b_{n}^{\prime\prime})_{2} + (1)_{2}$$

### Multiplication

\begin{align} v_1^{\prime} &= (-1)^{b_{m+n}}(1b_{n-1} b_{n-2} \cdots b_{0})_{2} \times 2^{-n} \times 2^{(b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2}-2^{m-1}+1} \\ v_2^{\prime} &= (-1)^{b_{m+n}^{\prime}}(1b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{0}^{\prime})_{2} \times 2^{-n} \times 2^{(b_{m+n-1}^{\prime} b_{m+n-2}^{\prime} \cdots b_{n}^{\prime})_{2}-2^{m-1}+1} \\ v_3 &= (-1)^{b_{m+n}^{\prime\prime}}(1b_{n-1}^{\prime\prime} b_{n-2}^{\prime\prime} \cdots b_{0}^{\prime\prime})_{2} \times 2^{-n} \times 2^{(b_{m+n-1}^{\prime\prime} b_{m+n-2}^{\prime\prime} \cdots b_{n}^{\prime\prime})_{2}-2^{m-1}+1} \\ &= (-1)^{b_{m+n}^{\prime\prime}}(\underbrace{00\cdots0}_{n-k} 1b_{n-1}^{\prime\prime} b_{n-2}^{\prime\prime} \cdots b_{0}^{\prime\prime} \underbrace{xx \cdots x}_{k})_{2} \times 2^{-k} \times 2^{-n} \times 2^{(b_{m+n-1}^{\prime\prime} b_{m+n-2}^{\prime\prime} \cdots b_{n}^{\prime\prime})_{2}-2^{m-1}+1 } \\ \end{align}

$$v_3 = v_1^{\prime} \times v_2^{\prime}$$

where $x$ is some unimportant value.

Then $v_1$ and $v_2$ could perform bitwise multiplication just like integer bitwise multiplication for $(1b_{n-1} b_{n-2} \cdots b_{0})_{2}$ and $(1b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{0}^{\prime})_{2}$.

$$(\underbrace{00\cdots0}_{n-k} 1b_{n-1}^{\prime\prime} b_{n-2}^{\prime\prime} \cdots b_{0}^{\prime\prime} \underbrace{xx \cdots x}_{k})_{2} \times 2^{-k} \times 2^{-n} \approx (1b_{n-1} b_{n-2} \cdots b_{0})_{2} \times 2^{-n} \times (1b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{0}^{\prime})_{2} \times 2^{-n}$$

$$(b_{m+n-1}^{\prime\prime} b_{m+n-2}^{\prime\prime} \cdots b_{n}^{\prime\prime})_{2}-2^{m-1}+1 - k - n = (b_{m+n-1} b_{m+n-2} \cdots b_{n})_{2}-2^{m-1}+1 -n + (b_{m+n-1}^{\prime} b_{m+n-2}^{\prime} \cdots b_{n}^{\prime})_{2}-2^{m-1}+1 - n$$

If $(1b_{n-1} b_{n-2} \cdots b_{0})_{2} \times (1b_{n-1}^{\prime} b_{n-2}^{\prime} \cdots b_{0}^{\prime})_{2} \geq 2^{2n+1}$, the exponent will have to add $1$, i.e.,

$$k \leftarrow k + 1$$

### Subtraction and Division

Knowing the addition and multiplication, it is not difficult to derive the arithmetic for subtraction and division.

## References

Floating Point Value Binary Representation

https://leimao.github.io/blog/Floating-Point-Representation/

Lei Mao

03-30-2022

11-28-2022