Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

Suppose we have $4$ unsigned values, $3$, $8$, $15$, and $1$, and we want to save these values as binary values to a file. In a modern computer, the minimal IO bit-width is $8$ bits, i.e., $1$ byte, $1$ byte could represent an unsigned values from $0$ to $255$, and the $4$ unsigned values we want to save are just within the range, so we will need $8 \times 4 = 32$ bits, i.e., $4$ bytes, to store the values to file.


It should be noted that the $4$ unsigned values, $3$, $8$, $15$, and $1$, are in a range from $0$ to $15$, which could be represented by $4$ bits. So this means that it is possible to store the $4$ unsigned values using $4 \times 4 = 16$ bits, i.e., $2$ bytes.


If $4$ unsigned values are in a range from $0$ to $7$, which could be represented by $3$ bits. Ideally, it is possible to store the $4$ unsigned values using $3 \times 4 = 12$ bits, i.e., $1.5$ bytes. However, because the minimal IO bit-width is $1$ byte, we would still need $2$ bytes, instead of $1.5$ bytes, to store the $4$ values.


In this blog post, I would like to discuss how to manipulate arbitrary bits in C++ so that a minimum storage could be achieved given some prior knowledge about the data.

Algorithm

The algorithm used for saving arbitrary bits is almost exactly the same as what we have described in the example above. We concatenate all the bit representations of all the values, fill the end of the concatenates bit representations with $0$s so that the size of the bit representation is divisible by 8. To write the values to hard drive, we save the filled bit representations and the size of valid bit representations as a binary file. To read the values from the binary file, the process is just a reverse of the writing process.

Implementation

With the encode_and_write_values function, we could save a vector of unsigned integers in a small range as a small binary file. With the encode_and_write_values function, we could recover the vector of unsigned integers from theS binary file.

// bits.cpp

#include <algorithm>
#include <bitset>
#include <cassert>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>

/**
 * @brief Operator overloading for printing vectors
 * @tparam T
 * @param os
 * @param v
 * @return std::ostream&
 */
template <typename T>
std::ostream& operator<<(std::ostream& os, const std::vector<T>& v)
{
    os << "[";
    for (int i = 0; i < v.size(); ++i)
    {
        os << v[i];
        if (i != v.size() - 1)
        {
            os << ", ";
        }
    }
    os << "]";
    return os;
}

/**
 * @brief Get the bit representation of a value.
 * We used an O(logN) algorithm here.
 * @param value
 * @param num_bits
 * @param offset offset to the first bit.
 * @return std::vector<bool>
 */
template <typename T>
std::vector<bool> get_bits(T value, unsigned int num_bits, unsigned int offset)
{
    std::vector<bool> bits;
    size_t value_bits = sizeof(value) * 8;
    assert(
        ("Getting bits outside the value.", num_bits + offset <= value_bits));
    for (int i = 0; i < num_bits; i++)
    {
        bool bit = (value >> (value_bits - 1 - offset - i)) & 1U;
        bits.push_back(bit);
    }
    return bits;
}

/**
 * @brief Get the value from a bit representation.
 *
 * @tparam T
 * @param bits
 * @param num_bits
 * @param offset
 * @return T
 */
template <typename T>
T get_value(const std::vector<bool>& bits, unsigned int num_bits,
            unsigned int offset)
{
    T value = 0;
    for (int i = 0; i < num_bits; i++)
    {
        bool bit = bits.at(i);
        unsigned int shift = sizeof(T) - 1 - offset - i;
        if (bit == true)
        {
            value |= (1U << shift);
        }
    }
    return value;
}

/**
 * @brief Convert values to bitsets and append to the existing bitsets.
 *
 * @param bitsets
 * @param values
 * @param bitset_size
 */
template <typename T>
void append_bitset(std::vector<bool>& bitsets, const std::vector<T>& values,
                   size_t bitset_size)
{
    for (int i = 0; i < values.size(); i++)
    {
        T integer = values.at(i);
        std::vector<bool> bits =
            get_bits(integer, bitset_size, sizeof(integer) * 8 - bitset_size);
        bitsets.insert(bitsets.end(), bits.begin(), bits.end());
    }
}

/**
 * @brief Convert bits to values and append to the existing values.
 *
 * @tparam T
 * @param values
 * @param bits
 * @param bitset_size
 */
template <typename T>
void append_values(std::vector<T>& values, const std::vector<bool>& bits,
                   size_t bitset_size)
{
    size_t num_values = bits.size() / bitset_size;
    unsigned int offset = sizeof(T) - bitset_size;
    for (int i = 0; i < num_values; i++)
    {
        std::vector<bool> value_bits;
        for (int j = 0; j < bitset_size; j++)
        {
            int k = i * bitset_size + j;
            value_bits.push_back(bits.at(k));
        }

        T value = get_value<T>(value_bits, bitset_size, offset);
        values.push_back(value);
    }
}

/**
 * @brief Fill the end of the bitsets with zeros so that the size is divisible
 * by 8.
 *
 * @param bitsets
 */
void fill_bitset(std::vector<bool>& bitsets)
{
    // If the size of bitsets is not divisible by 8,
    // Append 0s to the end to make the size of bitsets divisible by 8.
    size_t original_size = bitsets.size();
    for (int i = 0; i < (8 - original_size % 8) % 8; i++)
    {
        bitsets.push_back(0);
    }
}

/**
 * @brief Write bitsets to a binary file.
 *
 * @param bitsets
 * @param filepath
 */
void write_bitsets(const std::vector<bool>& bitsets, uint32_t num_valid_bits,
                   const std::string& filepath)
{
    assert(("bitsets size should be divisible by 8.", bitsets.size() % 8 == 0));
    std::fstream fhand;
    // trunc will clear the file
    fhand.open(filepath, fhand.binary | fhand.trunc | fhand.out);
    if (!fhand.is_open())
    {
        std::cerr << "Failed to open " << filepath << std::endl;
    }
    else
    {
        fhand.write(reinterpret_cast<char*>(&num_valid_bits),
                    sizeof(num_valid_bits));
        size_t num_bytes = bitsets.size() / 8;
        for (int i = 0; i < num_bytes; i++)
        {
            // ch: 00000000
            char ch = 0;
            for (int j = 0; j < 8; j++)
            {
                int k = i * 8 + j;
                ch |= (bitsets.at(k) << (8 - j - 1));
            }
            fhand.write(&ch, sizeof(ch));
        }
    }
    fhand.close();
}

/**
 * @brief Read bitsets from a binary file.
 *
 * @param bitsets
 * @param filepath
 */
const std::vector<bool> read_bitsets(const std::string& filepath)
{
    std::fstream fhand;
    std::vector<bool> bitsets;
    uint32_t num_valid_bits = 0;
    fhand.open(filepath, fhand.binary | fhand.in);
    if (!fhand.is_open())
    {
        std::cerr << "Failed to open " << filepath << std::endl;
    }
    else
    {
        fhand.read(reinterpret_cast<char*>(&num_valid_bits),
                   sizeof(num_valid_bits));
        while (fhand.peek() != EOF)
        {
            char ch;
            fhand.read(&ch, sizeof(ch));
            std::vector<bool> bits = get_bits(ch, sizeof(ch) * 8, 0);
            bitsets.insert(bitsets.end(), bits.begin(), bits.end());
        }
    }
    std::vector<bool> valid_bitsets{bitsets.begin(),
                                    bitsets.begin() + num_valid_bits};
    return valid_bitsets;
}

/**
 * @brief Get the size of a file.
 *
 * @param filepath
 * @return const long
 */
const long get_filesize(const std::string& filepath)
{
    long filesize = 0;
    std::fstream fhand;
    fhand.open(filepath, fhand.binary | fhand.in);
    if (!fhand.is_open())
    {
        std::cerr << "Failed to open " << filepath << std::endl;
    }
    else
    {
        fhand.seekg(0, std::ios::end);
        filesize = fhand.tellg();
    }
    return filesize;
}

/**
 * @brief Generate random unsigned integers in a range of [0, 2^{bitset_size}).
 *
 * @param bitset_size
 * @param num_integers
 * @param seed
 * @return std::vector<uint32_t>
 */
std::vector<uint32_t>
generate_random_unsigned_integers(size_t bitset_size, unsigned int num_integers,
                                  unsigned seed)
{
    std::srand(seed);
    std::vector<uint32_t> unsigned_integers;
    const uint32_t maxInt = ((1 << bitset_size) - 1);
    for (int i = 0; i < num_integers; i++)
    {
        uint32_t random_integer = std::rand() % maxInt;
        unsigned_integers.push_back(random_integer);
    }
    return unsigned_integers;
}

/**
 * @brief Encode and write uint32_t values to file.
 *
 * @param unsigned_integers
 * @param bitset_size The number of bits to represent the uint32_t value.
 * @param filepath
 */
void encode_and_write_values(const std::vector<uint32_t>& unsigned_integers,
                             size_t bitset_size, const std::string& filepath)
{
    std::vector<bool> bitsets;
    append_bitset(bitsets, unsigned_integers, bitset_size);
    uint32_t num_valid_bits = bitsets.size();
    fill_bitset(bitsets);
    write_bitsets(bitsets, num_valid_bits, filepath);
}

/**
 * @brief Read and decode uint32_t values to file.
 *
 * @param bitset_size The number of bits to represent the uint32_t value.
 * @param filepath
 */
std::vector<uint32_t> read_and_decode_values(size_t bitset_size,
                                             const std::string& filepath)
{
    std::vector<bool> bitsets_recovered;
    std::vector<uint32_t> unsigned_integers_recovered;
    bitsets_recovered = read_bitsets(filepath);
    append_values(unsigned_integers_recovered, bitsets_recovered, bitset_size);

    return unsigned_integers_recovered;
}

/**
 * @brief Test the implementation.
 *
 * @param bitset_size
 * @param seed
 * @param filepath
 */
void test(size_t bitset_size, unsigned seed, const std::string& filepath)
{
    const unsigned int num_integers{100};
    std::vector<uint32_t> unsigned_integers =
        generate_random_unsigned_integers(bitset_size, num_integers, seed);
    std::vector<uint32_t> unsigned_integers_recovered;
    const long expected_filesize =
        sizeof(uint32_t) +
        std::ceil(static_cast<float>(bitset_size * num_integers) /
                  8); // Unit: byte.
    long actual_filesize = 0;

    // We could have used std::bitset which makes our life easier.
    // However, the size for std::bitset object is only known at runtime.
    // According to https://en.cppreference.com/w/cpp/utility/bitset
    // If the size of the bitset is not known at compile time, std::vector<bool>
    // or boost::dynamic_bitset may be used. We avoid using Boost library and
    // use C++ STL here.
    std::vector<bool> bitsets;
    std::vector<bool> bitsets_recovered;

    append_bitset(bitsets, unsigned_integers, bitset_size);
    uint32_t num_valid_bits = bitsets.size();
    fill_bitset(bitsets);
    write_bitsets(bitsets, num_valid_bits, filepath);

    bitsets_recovered = read_bitsets(filepath);
    append_values(unsigned_integers_recovered, bitsets_recovered, bitset_size);
    assert(
        ("Integer recovery failed.",
         (std::equal(unsigned_integers.begin(),
                     unsigned_integers.begin() + unsigned_integers.size(),
                     unsigned_integers_recovered.begin()) &&
          (unsigned_integers.size() == unsigned_integers_recovered.size()))));
    actual_filesize = get_filesize(filepath);
    assert(("Binary file size does not match expectation.",
            expected_filesize == actual_filesize));
}

int main()
{
    const unsigned int seed{0};
    const unsigned int num_integers{100};
    const std::string filepath{"./values.bin"};
    const std::string temp_filepath{"./values_temp.bin"};
    size_t bitset_size = 3;

    std::vector<uint32_t> unsigned_integers =
        generate_random_unsigned_integers(bitset_size, num_integers, seed);
    std::cout << "Unsigned integers to encode and save: " << std::endl;
    std::cout << unsigned_integers << std::endl;
    encode_and_write_values(unsigned_integers, bitset_size, filepath);
    std::vector<uint32_t> unsigned_integers_recovered =
        read_and_decode_values(bitset_size, filepath);
    std::cout << "Unsigned integers loaded and decoded: " << std::endl;
    std::cout << unsigned_integers_recovered << std::endl;

    // Test bitset_size from 1 to 31.
    for (bitset_size = 1; bitset_size <= 31; bitset_size++)
    {
        test(bitset_size, seed, temp_filepath);
    }
}

To build the program, please run the following command in the terminal.

$ g++ -std=c++11 bits.cpp -o bits

From the printouts, we could see the for a vector of unsigned integers in a range of $[0, 7]$, we could save and load the values correctly using $3$ bits per value.

$ ./bits 
Unsigned integers to encode and save: 
[1, 4, 2, 5, 1, 3, 3, 2, 1, 3, 2, 5, 6, 4, 6, 0, 3, 1, 2, 2, 3, 0, 5, 2, 2, 5, 5, 6, 1, 5, 5, 0, 3, 5, 3, 2, 1, 0, 2, 2, 1, 4, 5, 0, 6, 3, 6, 2, 2, 1, 3, 5, 0, 1, 1, 2, 4, 4, 1, 3, 2, 6, 4, 5, 3, 5, 5, 4, 3, 0, 5, 3, 2, 3, 3, 2, 4, 0, 2, 0, 0, 3, 5, 0, 4, 4, 0, 2, 1, 1, 5, 4, 5, 0, 0, 1, 4, 6, 4, 0]
Unsigned integers loaded and decoded: 
[1, 4, 2, 5, 1, 3, 3, 2, 1, 3, 2, 5, 6, 4, 6, 0, 3, 1, 2, 2, 3, 0, 5, 2, 2, 5, 5, 6, 1, 5, 5, 0, 3, 5, 3, 2, 1, 0, 2, 2, 1, 4, 5, 0, 6, 3, 6, 2, 2, 1, 3, 5, 0, 1, 1, 2, 4, 4, 1, 3, 2, 6, 4, 5, 3, 5, 5, 4, 3, 0, 5, 3, 2, 3, 3, 2, 4, 0, 2, 0, 0, 3, 5, 0, 4, 4, 0, 2, 1, 1, 5, 4, 5, 0, 0, 1, 4, 6, 4, 0]

Also note that all the scenarios where the unsigned integers are in a range of $[0, 2^{n}-1]$, where $n = 1, 2, \cdots, 31$, have been tested.


The number of bytes required for saving all the values are $\lceil4 + 3 \times 100 / 8\rceil = 42$. Note that $4$ is a uint32_t value indicating the number of valid bits in the binary file. If we save all the unsigned integers as uint8_t values, each value will take $1$ byte, i.e., $8$ bits. In this case, we saved $5$ bits per value, and $100 \times 1 = 100$ bytes to save all the values.

$ ls -lh values.bin 
-rw-rw-r-- 1 leimao leimao 42 Jan  4 08:49 values.bin

Therefore, using the algorithm, for the unsigned integers in a range of $[0, 2^{n}-1]$, we could only use $n$ bits, instead of the multiples of bytes, to save one value.

Caveats

Since we only care about the size of the binary file that contains the information of the values, the implementation that I created for encoding and decoding values might not be the most efficient ones.

Final Remarks

We don’t need useless bits to encode the values in a binary file.

References