Read and Write Arbitrary Bits in C++
Introduction
Suppose we have $4$ unsigned values, $3$, $8$, $15$, and $1$, and we want to save these values as binary values to a file. In a modern computer, the minimal IO bit-width is $8$ bits, i.e., $1$ byte, $1$ byte could represent an unsigned values from $0$ to $255$, and the $4$ unsigned values we want to save are just within the range, so we will need $8 \times 4 = 32$ bits, i.e., $4$ bytes, to store the values to file.
It should be noted that the $4$ unsigned values, $3$, $8$, $15$, and $1$, are in a range from $0$ to $15$, which could be represented by $4$ bits. So this means that it is possible to store the $4$ unsigned values using $4 \times 4 = 16$ bits, i.e., $2$ bytes.
If $4$ unsigned values are in a range from $0$ to $7$, which could be represented by $3$ bits. Ideally, it is possible to store the $4$ unsigned values using $3 \times 4 = 12$ bits, i.e., $1.5$ bytes. However, because the minimal IO bit-width is $1$ byte, we would still need $2$ bytes, instead of $1.5$ bytes, to store the $4$ values.
In this blog post, I would like to discuss how to manipulate arbitrary bits in C++ so that a minimum storage could be achieved given some prior knowledge about the data.
Algorithm
The algorithm used for saving arbitrary bits is almost exactly the same as what we have described in the example above. We concatenate all the bit representations of all the values, fill the end of the concatenates bit representations with $0$s so that the size of the bit representation is divisible by 8. To write the values to hard drive, we save the filled bit representations and the size of valid bit representations as a binary file. To read the values from the binary file, the process is just a reverse of the writing process.
Implementation
With the encode_and_write_values
function, we could save a vector of unsigned integers in a small range as a small binary file. With the encode_and_write_values
function, we could recover the vector of unsigned integers from the binary file.
1 |
|
To build the program, please run the following command in the terminal.
1 | $ g++ -std=c++11 bits.cpp -o bits |
From the printouts, we could see the for a vector of unsigned integers in a range of $[0, 7]$, we could save and load the values correctly using $3$ bits per value.
1 | $ ./bits |
Also note that all the scenarios where the unsigned integers are in a range of $[0, 2^{n}-1]$, where $n = 1, 2, \cdots, 31$, have been tested.
The number of bytes required for saving all the values are $\lceil4 + 3 \times 100 / 8\rceil = 42$. Note that $4$ is a uint32_t
value indicating the number of valid bits in the binary file. If we save all the unsigned integers as uint8_t
values, each value will take $1$ byte, i.e., $8$ bits, and we don’t need an additional value to indicate the number of valid bits. In this case, we will use $1$ byte per value, and $100 \times 1 = 100$ bytes to save all the values.
1 | $ ls -lh values.bin |
Conclusion
Therefore, using the algorithm, for the unsigned integers in a range of $[0, 2^{n}-1]$, we could only use $n$ bits, instead of the multiples of bytes, to save one value. The resulting size of the binary file will be smaller.
Caveats
Since we only care about the size of the binary file that contains the information of the values, the implementation that I created for encoding and decoding values might not be the most efficient ones.
Final Remarks
We don’t need useless bits to encode the values in a binary file.
References
Read and Write Arbitrary Bits in C++
https://leimao.github.io/blog/CPP-Read-Write-Arbitrary-Bits/