Read and Write Arbitrary Bits in C++

Introduction

Suppose we have $4$ unsigned values, $3$, $8$, $15$, and $1$, and we want to save these values as binary values to a file. In a modern computer, the minimal IO bit-width is $8$ bits, i.e., $1$ byte, $1$ byte could represent an unsigned values from $0$ to $255$, and the $4$ unsigned values we want to save are just within the range, so we will need $8 \times 4 = 32$ bits, i.e., $4$ bytes, to store the values to file.

It should be noted that the $4$ unsigned values, $3$, $8$, $15$, and $1$, are in a range from $0$ to $15$, which could be represented by $4$ bits. So this means that it is possible to store the $4$ unsigned values using $4 \times 4 = 16$ bits, i.e., $2$ bytes.

If $4$ unsigned values are in a range from $0$ to $7$, which could be represented by $3$ bits. Ideally, it is possible to store the $4$ unsigned values using $3 \times 4 = 12$ bits, i.e., $1.5$ bytes. However, because the minimal IO bit-width is $1$ byte, we would still need $2$ bytes, instead of $1.5$ bytes, to store the $4$ values.

In this blog post, I would like to discuss how to manipulate arbitrary bits in C++ so that a minimum storage could be achieved given some prior knowledge about the data.

Algorithm

The algorithm used for saving arbitrary bits is almost exactly the same as what we have described in the example above. We concatenate all the bit representations of all the values, fill the end of the concatenates bit representations with $0$s so that the size of the bit representation is divisible by 8. To write the values to hard drive, we save the filled bit representations and the size of valid bit representations as a binary file. To read the values from the binary file, the process is just a reverse of the writing process.

Implementation

With the encode_and_write_values function, we could save a vector of unsigned integers in a small range as a small binary file. With the encode_and_write_values function, we could recover the vector of unsigned integers from the binary file.

bits.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
#include <algorithm>
#include <bitset>
#include <cassert>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>

/**
* @brief Operator overloading for printing vectors
* @tparam T
* @param os
* @param v
* @return std::ostream&
*/
template <typename T>
std::ostream& operator<<(std::ostream& os, const std::vector<T>& v)
{
os << "[";
for (int i = 0; i < v.size(); ++i)
{
os << v[i];
if (i != v.size() - 1)
{
os << ", ";
}
}
os << "]";
return os;
}

/**
* @brief Get the bit representation of a value.
* We used an O(logN) algorithm here.
* @param value
* @param num_bits
* @param offset offset to the first bit.
* @return std::vector<bool>
*/
template <typename T>
std::vector<bool> get_bits(T value, unsigned int num_bits, unsigned int offset)
{
std::vector<bool> bits;
size_t value_bits = sizeof(value) * 8;
assert(
("Getting bits outside the value.", num_bits + offset <= value_bits));
for (int i = 0; i < num_bits; i++)
{
bool bit = (value >> (value_bits - 1 - offset - i)) & 1U;
bits.push_back(bit);
}
return bits;
}

/**
* @brief Get the value from a bit representation.
*
* @tparam T
* @param bits
* @param num_bits
* @param offset
* @return T
*/
template <typename T>
T get_value(const std::vector<bool>& bits, unsigned int num_bits,
unsigned int offset)
{
T value = 0;
for (int i = 0; i < num_bits; i++)
{
bool bit = bits.at(i);
unsigned int shift = sizeof(T) - 1 - offset - i;
if (bit == true)
{
value |= (1U << shift);
}
}
return value;
}

/**
* @brief Convert values to bitsets and append to the existing bitsets.
*
* @param bitsets
* @param values
* @param bitset_size
*/
template <typename T>
void append_bitset(std::vector<bool>& bitsets, const std::vector<T>& values,
size_t bitset_size)
{
for (int i = 0; i < values.size(); i++)
{
T integer = values.at(i);
std::vector<bool> bits =
get_bits(integer, bitset_size, sizeof(integer) * 8 - bitset_size);
bitsets.insert(bitsets.end(), bits.begin(), bits.end());
}
}

/**
* @brief Convert bits to values and append to the existing values.
*
* @tparam T
* @param values
* @param bits
* @param bitset_size
*/
template <typename T>
void append_values(std::vector<T>& values, const std::vector<bool>& bits,
size_t bitset_size)
{
size_t num_values = bits.size() / bitset_size;
unsigned int offset = sizeof(T) - bitset_size;
for (int i = 0; i < num_values; i++)
{
std::vector<bool> value_bits;
for (int j = 0; j < bitset_size; j++)
{
int k = i * bitset_size + j;
value_bits.push_back(bits.at(k));
}

T value = get_value<T>(value_bits, bitset_size, offset);
values.push_back(value);
}
}

/**
* @brief Fill the end of the bitsets with zeros so that the size is divisible
* by 8.
*
* @param bitsets
*/
void fill_bitset(std::vector<bool>& bitsets)
{
// If the size of bitsets is not divisible by 8,
// Append 0s to the end to make the size of bitsets divisible by 8.
size_t original_size = bitsets.size();
for (int i = 0; i < (8 - original_size % 8) % 8; i++)
{
bitsets.push_back(0);
}
}

/**
* @brief Write bitsets to a binary file.
*
* @param bitsets
* @param filepath
*/
void write_bitsets(const std::vector<bool>& bitsets, uint32_t num_valid_bits,
const std::string& filepath)
{
assert(("bitsets size should be divisible by 8.", bitsets.size() % 8 == 0));
std::fstream fhand;
// trunc will clear the file
fhand.open(filepath, fhand.binary | fhand.trunc | fhand.out);
if (!fhand.is_open())
{
std::cerr << "Failed to open " << filepath << std::endl;
}
else
{
fhand.write(reinterpret_cast<char*>(&num_valid_bits),
sizeof(num_valid_bits));
size_t num_bytes = bitsets.size() / 8;
for (int i = 0; i < num_bytes; i++)
{
// ch: 00000000
char ch = 0;
for (int j = 0; j < 8; j++)
{
int k = i * 8 + j;
ch |= (bitsets.at(k) << (8 - j - 1));
}
fhand.write(&ch, sizeof(ch));
}
}
fhand.close();
}

/**
* @brief Read bitsets from a binary file.
*
* @param bitsets
* @param filepath
*/
const std::vector<bool> read_bitsets(const std::string& filepath)
{
std::fstream fhand;
std::vector<bool> bitsets;
uint32_t num_valid_bits = 0;
fhand.open(filepath, fhand.binary | fhand.in);
if (!fhand.is_open())
{
std::cerr << "Failed to open " << filepath << std::endl;
}
else
{
fhand.read(reinterpret_cast<char*>(&num_valid_bits),
sizeof(num_valid_bits));
while (fhand.peek() != EOF)
{
char ch;
fhand.read(&ch, sizeof(ch));
std::vector<bool> bits = get_bits(ch, sizeof(ch) * 8, 0);
bitsets.insert(bitsets.end(), bits.begin(), bits.end());
}
}
std::vector<bool> valid_bitsets{bitsets.begin(),
bitsets.begin() + num_valid_bits};
return valid_bitsets;
}

/**
* @brief Get the size of a file.
*
* @param filepath
* @return const long
*/
const long get_filesize(const std::string& filepath)
{
long filesize = 0;
std::fstream fhand;
fhand.open(filepath, fhand.binary | fhand.in);
if (!fhand.is_open())
{
std::cerr << "Failed to open " << filepath << std::endl;
}
else
{
fhand.seekg(0, std::ios::end);
filesize = fhand.tellg();
}
return filesize;
}

/**
* @brief Generate random unsigned integers in a range of [0, 2^{bitset_size}).
*
* @param bitset_size
* @param num_integers
* @param seed
* @return std::vector<uint32_t>
*/
std::vector<uint32_t>
generate_random_unsigned_integers(size_t bitset_size, unsigned int num_integers,
unsigned seed)
{
std::srand(seed);
std::vector<uint32_t> unsigned_integers;
const uint32_t maxInt = ((1 << bitset_size) - 1);
for (int i = 0; i < num_integers; i++)
{
uint32_t random_integer = std::rand() % maxInt;
unsigned_integers.push_back(random_integer);
}
return unsigned_integers;
}

/**
* @brief Encode and write uint32_t values to file.
*
* @param unsigned_integers
* @param bitset_size The number of bits to represent the uint32_t value.
* @param filepath
*/
void encode_and_write_values(const std::vector<uint32_t>& unsigned_integers,
size_t bitset_size, const std::string& filepath)
{
std::vector<bool> bitsets;
append_bitset(bitsets, unsigned_integers, bitset_size);
uint32_t num_valid_bits = bitsets.size();
fill_bitset(bitsets);
write_bitsets(bitsets, num_valid_bits, filepath);
}

/**
* @brief Read and decode uint32_t values to file.
*
* @param bitset_size The number of bits to represent the uint32_t value.
* @param filepath
*/
std::vector<uint32_t> read_and_decode_values(size_t bitset_size,
const std::string& filepath)
{
std::vector<bool> bitsets_recovered;
std::vector<uint32_t> unsigned_integers_recovered;
bitsets_recovered = read_bitsets(filepath);
append_values(unsigned_integers_recovered, bitsets_recovered, bitset_size);

return unsigned_integers_recovered;
}

/**
* @brief Test the implementation.
*
* @param bitset_size
* @param seed
* @param filepath
*/
void test(size_t bitset_size, unsigned seed, const std::string& filepath)
{
const unsigned int num_integers{100};
std::vector<uint32_t> unsigned_integers =
generate_random_unsigned_integers(bitset_size, num_integers, seed);
std::vector<uint32_t> unsigned_integers_recovered;
const long expected_filesize =
sizeof(uint32_t) +
std::ceil(static_cast<float>(bitset_size * num_integers) /
8); // Unit: byte.
long actual_filesize = 0;

// We could have used std::bitset which makes our life easier.
// However, the size for std::bitset object is only known at runtime.
// According to https://en.cppreference.com/w/cpp/utility/bitset
// If the size of the bitset is not known at compile time, std::vector<bool>
// or boost::dynamic_bitset may be used. We avoid using Boost library and
// use C++ STL here.
std::vector<bool> bitsets;
std::vector<bool> bitsets_recovered;

append_bitset(bitsets, unsigned_integers, bitset_size);
uint32_t num_valid_bits = bitsets.size();
fill_bitset(bitsets);
write_bitsets(bitsets, num_valid_bits, filepath);

bitsets_recovered = read_bitsets(filepath);
append_values(unsigned_integers_recovered, bitsets_recovered, bitset_size);
assert(
("Integer recovery failed.",
(std::equal(unsigned_integers.begin(),
unsigned_integers.begin() + unsigned_integers.size(),
unsigned_integers_recovered.begin()) &&
(unsigned_integers.size() == unsigned_integers_recovered.size()))));
actual_filesize = get_filesize(filepath);
assert(("Binary file size does not match expectation.",
expected_filesize == actual_filesize));
}

int main()
{
const unsigned int seed{0};
const unsigned int num_integers{100};
const std::string filepath{"./values.bin"};
const std::string temp_filepath{"./values_temp.bin"};
size_t bitset_size = 3;

std::vector<uint32_t> unsigned_integers =
generate_random_unsigned_integers(bitset_size, num_integers, seed);
std::cout << "Unsigned integers to encode and save: " << std::endl;
std::cout << unsigned_integers << std::endl;
encode_and_write_values(unsigned_integers, bitset_size, filepath);
std::vector<uint32_t> unsigned_integers_recovered =
read_and_decode_values(bitset_size, filepath);
std::cout << "Unsigned integers loaded and decoded: " << std::endl;
std::cout << unsigned_integers_recovered << std::endl;

// Test bitset_size from 1 to 31.
for (bitset_size = 1; bitset_size <= 31; bitset_size++)
{
test(bitset_size, seed, temp_filepath);
}
}

To build the program, please run the following command in the terminal.

1
$ g++ -std=c++11 bits.cpp -o bits

From the printouts, we could see the for a vector of unsigned integers in a range of $[0, 7]$, we could save and load the values correctly using $3$ bits per value.

1
2
3
4
5
$ ./bits 
Unsigned integers to encode and save:
[1, 4, 2, 5, 1, 3, 3, 2, 1, 3, 2, 5, 6, 4, 6, 0, 3, 1, 2, 2, 3, 0, 5, 2, 2, 5, 5, 6, 1, 5, 5, 0, 3, 5, 3, 2, 1, 0, 2, 2, 1, 4, 5, 0, 6, 3, 6, 2, 2, 1, 3, 5, 0, 1, 1, 2, 4, 4, 1, 3, 2, 6, 4, 5, 3, 5, 5, 4, 3, 0, 5, 3, 2, 3, 3, 2, 4, 0, 2, 0, 0, 3, 5, 0, 4, 4, 0, 2, 1, 1, 5, 4, 5, 0, 0, 1, 4, 6, 4, 0]
Unsigned integers loaded and decoded:
[1, 4, 2, 5, 1, 3, 3, 2, 1, 3, 2, 5, 6, 4, 6, 0, 3, 1, 2, 2, 3, 0, 5, 2, 2, 5, 5, 6, 1, 5, 5, 0, 3, 5, 3, 2, 1, 0, 2, 2, 1, 4, 5, 0, 6, 3, 6, 2, 2, 1, 3, 5, 0, 1, 1, 2, 4, 4, 1, 3, 2, 6, 4, 5, 3, 5, 5, 4, 3, 0, 5, 3, 2, 3, 3, 2, 4, 0, 2, 0, 0, 3, 5, 0, 4, 4, 0, 2, 1, 1, 5, 4, 5, 0, 0, 1, 4, 6, 4, 0]

Also note that all the scenarios where the unsigned integers are in a range of $[0, 2^{n}-1]$, where $n = 1, 2, \cdots, 31$, have been tested.

The number of bytes required for saving all the values are $\lceil4 + 3 \times 100 / 8\rceil = 42$. Note that $4$ is a uint32_t value indicating the number of valid bits in the binary file. If we save all the unsigned integers as uint8_t values, each value will take $1$ byte, i.e., $8$ bits, and we don’t need an additional value to indicate the number of valid bits. In this case, we will use $1$ byte per value, and $100 \times 1 = 100$ bytes to save all the values.

1
2
$ ls -lh values.bin 
-rw-rw-r-- 1 leimao leimao 42 Jan 4 08:49 values.bin

Conclusion

Therefore, using the algorithm, for the unsigned integers in a range of $[0, 2^{n}-1]$, we could only use $n$ bits, instead of the multiples of bytes, to save one value. The resulting size of the binary file will be smaller.

Caveats

Since we only care about the size of the binary file that contains the information of the values, the implementation that I created for encoding and decoding values might not be the most efficient ones.

Final Remarks

We don’t need useless bits to encode the values in a binary file.

References

Author

Lei Mao

Posted on

01-05-2021

Updated on

01-05-2021

Licensed under


Comments