Binary VS Text Mode for File I/O Operations
Introduction
When we try to read or write files in our program, usually there are two modes to use. Text mode, usually by default, and binary mode. Obviously, in text mode, the program writes data to file as text characters, and in binary mode, the program writes data to files as 0/1 bits. While it sounds trivial to distinguish the two modes, people sometimes got confused. Since the computer only reads and writes in binary formats, where is this text mode coming from?
In this blog post, I am going to talk about the conceptual difference between the text mode and the binary model, and discuss some caveats of using them.
Example
We have a signed int
-10000
, an unsigned short
100
, and a C string WE
. Their binary sequence representations in an 64-bit computer is as follows.
signed int
-10000
:
1 | 11111111 11111111 11011000 11110000 |
std::string
-10000
:
1 | 00101101 00110001 00110000 00110000 00110000 00110000 |
unsigned short
100
:
1 | 00000000 01100100 |
std::string
100
:
1 | 00110001 00110000 00110000 |
C string WE
:
1 | 01010111 01000101 00000000 |
Note that C string always has a 0 at the end of the string.
When we save the three values to file, the binary sequence of the file is simply a concatenation of all the values.
Binary Mode
The binary mode is very easy to understand. For each piece of the data on the computer, they are represented as binary sequences on the memory or hard drive.
Writing File
To save the data in binary mode, we simply take the exact binary sequence representing the data, and save it to the file. Nothing fancy.
Reading File
Because the saved file has no knowledge about the data structure of its content, to read the data saved in binary mode, the users would need to implement the decoding method themselves.
Expected Output of the Example Using Binary Mode
1 | 11111111 11111111 11011000 11110000 00000000 01100100 01010111 01000101 00000000 |
When the computer sees the binary sequence from the binary file, it would have no clue to decode it back to the original values. It is our users’ responsibility to tell the computer, the first 4 bytes represent a signed int
, the next 2 bytes represent an unsigned short
, and the next 3 bytes represent a C string, so that the computer would know how to decode.
Code Example
We implemented a code example binaryIO.cpp
for saving data in binary mode.
1 |
|
We compiled the above code using the following command.
1 | $ g++ binaryIO.cpp -o binaryIO |
We ran the executable and got the following outputs.
1 | $ ./binaryIO |
Be aware that the numeric values are saved with the order of bytes reversed, which is an endianness artifact of my x86 platform (x86 architecture uses the little-endian convention). Except for this, everything saved to the file matches our expectation.
The size of the saved file is exactly 9 bytes.
1 | $ ls -lh data.bin |
In addition, because of the endianness artifact, to decode a binary file in another platform, we will have to know the endianness of the encoding platform. Otherwise, the values might not be decoded correctly.
Text Mode
The text mode is nothing special but converts the data to string format, and use the binary representation of the string to represent the data.
Writing File
Because the encoding and decoding methods, such as ASCII and UTF-8, of string characters have been implemented already. The user does not have to implement any encoding and decoding methods, but let the program know which encoding and decoding methods to use. Some of the data which could be implicitly converted to strings could also be saved using the text mode. However, the data which could not be converted to strings could not be saved using the text mode.
When there is more than one value to be saved into the file, it is the user’s responsibility to parse the text. Usually, the user would use some special delimiters such as \n
to separate different values.
Reading File
Similarly, because the base unit of the text is character. When it comes to reading the file using the text mode, the program would just have to read the file byte by byte, and decode each byte to character using the decoding method the user-specified.
Expected Output of the Example Using Text Mode
1 | 00101101 00110001 00110000 00110000 00110000 00110000 00110001 00110000 00110000 01010111 01000101 00000000 |
When the computer sees the binary sequence from the text file, since each byte is a character, it would just decode byte to character one by one.
Code Example
We implemented a code example textIO.cpp
for saving data in binary mode.
1 |
|
We compiled the above code using the following command.
1 | $ g++ textIO.cpp -o textIO |
We ran the executable and got the following outputs.
1 | $ ./textIO |
Note that 00001010
is the delimiter \n
. Except for the three delimiters we inserted, everything else matches our expectations.
The size of the saved file is exactly 15 bytes. Even if we do not count the three delimiters inserted, the size would be 12 bytes.
1 | $ ls -lh data.txt |
Conclusions
Writing data using the binary mode takes smaller disk or memory sizes comparing to writing data using the text mode. That’s why large data storage and low latency file transmission often use binary formats.
The shortcoming of the binary mode is that you should know the data structure and the exact methods for decoding the data. Implementation of the decoding method for each specific data structure would be time consuming. However, with the rise of libraries for handling the binary encoding and decoding for different data structures, such as Google’s Protocol Buffer, we could handle the writing and reading for binary files more easily for most of the common data structures.
References
Binary VS Text Mode for File I/O Operations