Binary VS Text Mode for File I/O Operations

Introduction

When we try to read or write files in our program, usually there are two modes to use. Text mode, usually by default, and binary mode. Obviously, in text mode, the program writes data to file as text characters, and in binary mode, the program writes data to files as 0/1 bits. While it sounds trivial to distinguish the two modes, people sometimes got confused. Since the computer only reads and writes in binary formats, where is this text mode coming from?

In this blog post, I am going to talk about the conceptual difference between the text mode and the binary model, and discuss some caveats of using them.

Example

We have a signed int -10000, an unsigned short 100, and a C string WE. Their binary sequence representations in an 64-bit computer is as follows.

signed int -10000:

1
11111111 11111111 11011000 11110000

std::string -10000:

1
00101101 00110001 00110000 00110000 00110000 00110000

unsigned short 100:

1
00000000 01100100

std::string 100:

1
00110001 00110000 00110000

C string WE:

1
01010111 01000101 00000000

Note that C string always has a 0 at the end of the string.

When we save the three values to file, the binary sequence of the file is simply a concatenation of all the values.

Binary Mode

The binary mode is very easy to understand. For each piece of the data on the computer, they are represented as binary sequences on the memory or hard drive.

Writing File

To save the data in binary mode, we simply take the exact binary sequence representing the data, and save it to the file. Nothing fancy.

Reading File

Because the saved file has no knowledge about the data structure of its content, to read the data saved in binary mode, the users would need to implement the decoding method themselves.

Expected Output of the Example Using Binary Mode

1
11111111 11111111 11011000 11110000 00000000 01100100 01010111 01000101 00000000

When the computer sees the binary sequence from the binary file, it would have no clue to decode it back to the original values. It is our users’ responsibility to tell the computer, the first 4 bytes represent a signed int, the next 2 bytes represent an unsigned short, and the next 3 bytes represent a C string, so that the computer would know how to decode.

Code Example

We implemented a code example binaryIO.cpp for saving data in binary mode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <iostream>
#include <fstream>
#include <bitset>

void printBitSequenceFromFile(std::string filename)
{
std::fstream fhand(filename, fhand.binary | fhand.in);
// One char is exactly one byte
char c;
while (fhand.read(reinterpret_cast<char*>(&c), sizeof(c)))
{
if (!fhand.eof())
{
std::cout << std::bitset<8>(c) << " ";
}
}
std::cout << std::endl;
fhand.close();
}

int main()
{
std::string filename = "data.bin";
signed int a = -10000;
unsigned short b = 100;
const char c[] = "WE";
const size_t str_size = sizeof(c);

std::cout << "Encoding values:" << std::endl;
std::cout << a << std::endl;
std::cout << b << std::endl;
std::cout << c << std::endl;

std::fstream fhand;
// trunc will clear the file
fhand.open(filename, fhand.binary | fhand.trunc | fhand.out);

fhand.write(reinterpret_cast<char*>(&a), sizeof(a));
fhand.write(reinterpret_cast<char*>(&b), sizeof(b));
fhand.write(c, str_size);

fhand.close();

std::cout << "Bit sequence in the file: " << std::endl;
printBitSequenceFromFile(filename);

signed int d;
unsigned short e;
char f[str_size];

fhand.open(filename, fhand.binary | fhand.in);

fhand.read(reinterpret_cast<char*>(&d), sizeof(d));
fhand.read(reinterpret_cast<char*>(&e), sizeof(e));
fhand.read(f, str_size);
std::cout << "Decoded values:" << std::endl;
std::cout << d << std::endl;
std::cout << e << std::endl;
std::cout << f << std::endl;

fhand.close();
}

We compiled the above code using the following command.

1
$ g++ binaryIO.cpp -o binaryIO

We ran the executable and got the following outputs.

1
2
3
4
5
6
7
8
9
10
11
$ ./binaryIO 
Encoding values:
-10000
100
WE
Bit sequence in the file:
11110000 11011000 11111111 11111111 01100100 00000000 01010111 01000101 00000000
Decoded values:
-10000
100
WE

Be aware that the numeric values are saved with the order of bytes reversed, which is an endianness artifact of my x86 platform (x86 architecture uses the little-endian convention). Except for this, everything saved to the file matches our expectation.

The size of the saved file is exactly 9 bytes.

1
2
$ ls -lh data.bin 
-rw-r--r-- 1 leimao leimao 9 Dec 22 15:15 data.bin

In addition, because of the endianness artifact, to decode a binary file in another platform, we will have to know the endianness of the encoding platform. Otherwise, the values might not be decoded correctly.

Text Mode

The text mode is nothing special but converts the data to string format, and use the binary representation of the string to represent the data.

Writing File

Because the encoding and decoding methods, such as ASCII and UTF-8, of string characters have been implemented already. The user does not have to implement any encoding and decoding methods, but let the program know which encoding and decoding methods to use. Some of the data which could be implicitly converted to strings could also be saved using the text mode. However, the data which could not be converted to strings could not be saved using the text mode.

When there is more than one value to be saved into the file, it is the user’s responsibility to parse the text. Usually, the user would use some special delimiters such as \n to separate different values.

Reading File

Similarly, because the base unit of the text is character. When it comes to reading the file using the text mode, the program would just have to read the file byte by byte, and decode each byte to character using the decoding method the user-specified.

Expected Output of the Example Using Text Mode

1
00101101 00110001 00110000 00110000 00110000 00110000 00110001 00110000 00110000 01010111 01000101 00000000

When the computer sees the binary sequence from the text file, since each byte is a character, it would just decode byte to character one by one.

Code Example

We implemented a code example textIO.cpp for saving data in binary mode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#include <iostream>
#include <fstream>
#include <bitset>
#include <string>
#include <cstdlib>

void printBitSequenceFromFile(std::string filename)
{
std::fstream fhand(filename, fhand.binary | fhand.in);
// One char is exactly one byte
char c;
while (fhand.read(reinterpret_cast<char*>(&c), sizeof(c)))
{
if (!fhand.eof())
{
std::cout << std::bitset<8>(c) << " ";
}
}
std::cout << std::endl;
fhand.close();
}

int main()
{
std::string filename = "data.txt";
char delimiter = '\n';
signed int a = -10000;
unsigned short b = 100;
const char c[] = "WE";
const size_t str_size = sizeof(c);

std::cout << "Encoding values:" << std::endl;
std::cout << a << std::endl;
std::cout << b << std::endl;
std::cout << c << std::endl;

std::fstream fhand;
// trunc will clear the file
fhand.open(filename, fhand.trunc | fhand.out);

// Implicitly write character by character
fhand << a << delimiter;
fhand << b << delimiter;
fhand.write(c, str_size);
fhand << delimiter;

fhand.close();

std::cout << "Bit sequence in the file: " << std::endl;
printBitSequenceFromFile(filename);

char d_str[255];
signed int d;
char e_str[255];
unsigned short e;
char f[str_size];

fhand.open(filename, fhand.in);

fhand.getline(d_str, 255, delimiter);
fhand.getline(e_str, 255, delimiter);
fhand.getline(f, str_size, delimiter);

// Convert C string back to the original type
d = static_cast<signed int>(atoi(d_str));
e = static_cast<unsigned short>(atoi(e_str));

std::cout << "Decoded values:" << std::endl;
std::cout << d << std::endl;
std::cout << e << std::endl;
std::cout << f << std::endl;

fhand.close();
}

We compiled the above code using the following command.

1
$ g++ textIO.cpp -o textIO

We ran the executable and got the following outputs.

1
2
3
4
5
6
7
8
9
10
11
$ ./textIO
Encoding values:
-10000
100
WE
Bit sequence in the file:
00101101 00110001 00110000 00110000 00110000 00110000 00001010 00110001 00110000 00110000 00001010 01010111 01000101 00000000 00001010
Decoded values:
-10000
100
WE

Note that 00001010 is the delimiter \n. Except for the three delimiters we inserted, everything else matches our expectations.

The size of the saved file is exactly 15 bytes. Even if we do not count the three delimiters inserted, the size would be 12 bytes.

1
2
$ ls -lh data.txt 
-rw-r--r-- 1 leimao leimao 15 Dec 22 15:34 data.txt

Conclusions

Writing data using the binary mode takes smaller disk or memory sizes comparing to writing data using the text mode. That’s why large data storage and low latency file transmission often use binary formats.

The shortcoming of the binary mode is that you should know the data structure and the exact methods for decoding the data. Implementation of the decoding method for each specific data structure would be time consuming. However, with the rise of libraries for handling the binary encoding and decoding for different data structures, such as Google’s Protocol Buffer, we could handle the writing and reading for binary files more easily for most of the common data structures.

References

Binary VS Text Mode for File I/O Operations

https://leimao.github.io/blog/File-IO-Binary-VS-Text/

Author

Lei Mao

Posted on

12-22-2019

Updated on

09-16-2022

Licensed under


Comments