 ### Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

# Binary VS Text Mode for File I/O Operations

### Introduction

When we try to read or write files in our program, usually there are two modes to use. Text mode, usually by default, and binary mode. Obviously, in text mode, the program writes data to file as text characters, and in binary mode, the program writes data to files as 0/1 bits. While it sounds trivial to distinguish the two modes, people sometimes got confused. Since computer only reads and writes in binary formats, where is this text mode coming from?

In this blog post, I am going to talk about the conceptual difference between the text mode and the binary model, and discuss some caveats of using them.

### Example

We have a signed int -10000, an unsigned shot 100, and a C string WE. Their binary sequence representations in an 64-bit computer is as follows.

signed int -10000:

11111111 11111111 11011000 11110000


std::string -10000:

00101101 00110001 00110000 00110000 00110000 00110000


unsigned shot 100:

00000000 01100100


std::string 100:

00110001 00110000 00110000


C string WE:

01010111 01000101 00000000


Note that C string always has a 0 at the end of the string.

When we save the three values to file, the binary sequence of the file is simply a concatenation of all the values.

### Binary Mode

The binary mode is very easy to understand. For each piece of the data on the computer, they are represented as binary sequences on the memory or hard drive.

#### Writing File

To save the data in binary mode, we simply take the exact binary sequence representing the data, and save it to the file. Nothing fancy.

Because the saved file has no knowledge about the data structure of its content, to read the data saved in binary mode, the users would need to implement the decoding method themselves.

#### Expected Output of the Example Using Binary Mode

11111111 11111111 11011000 11110000 00000000 01100100 01010111 01000101 00000000


When the computer see the binary sequence from the binary file, it would have no clue to decode it back to the original values. It is our users’ responsibility to tell the computer, the first 4 bytes represent a signed int, the next 2 bytes represent an unsigned shot, and the next 3 bytes represent a C string, so that the computer would know how to decode.

#### Code Example

We implemented a code example binaryIO.cpp for saving data in binary mode.

#include <iostream>
#include <fstream>
#include <bitset>

void printBitSequenceFromFile(std::string filename)
{
std::fstream fhand(filename, fhand.binary | fhand.in);
// One char is exactly one byte
char c;
{
if (!fhand.eof())
{
std::cout << std::bitset<8>(c) << " ";
}
}
std::cout << std::endl;
fhand.close();
}

int main()
{
std::string filename = "data.bin";
signed int a = -10000;
unsigned short b = 100;
const char c[] = "WE";
const size_t str_size = sizeof(c);

std::cout << "Encoding values:" << std::endl;
std::cout << a << std::endl;
std::cout << b << std::endl;
std::cout << c << std::endl;

std::fstream fhand;
// trunc will clear the file
fhand.open(filename, fhand.binary | fhand.trunc | fhand.out);

fhand.write(reinterpret_cast<char*>(&a), sizeof(a));
fhand.write(reinterpret_cast<char*>(&b), sizeof(b));
fhand.write(c, str_size);

fhand.close();

std::cout << "Bit sequence in the file: " << std::endl;
printBitSequenceFromFile(filename);

signed int d;
unsigned short e;
char f[str_size];

fhand.open(filename, fhand.binary | fhand.in);

std::cout << "Decoded values:" << std::endl;
std::cout << d << std::endl;
std::cout << e << std::endl;
std::cout << f << std::endl;

fhand.close();
}


We compiled the above code using the following command.

$g++ binaryIO.cpp -o binaryIO  We ran the executable and got the following outputs. $ ./binaryIO
Encoding values:
-10000
100
WE
Bit sequence in the file:
11110000 11011000 11111111 11111111 01100100 00000000 01010111 01000101 00000000
Decoded values:
-10000
100
WE


Be ware that the numeric values are saved with the order of bytes reversed, which is an artifact of C/C++. Except this, everything saved to the file matches to our expectation.

The size of the saved file is exactly 9 bytes.

$ls -lh data.bin -rw-r--r-- 1 leimao leimao 9 Dec 22 15:15 data.bin  ### Text Mode The text mode is nothing special but convert the data to string format, and use the binary representation of the string to represent the data. #### Writing File Because the encoding and decoding methods, such as ASCII and UTF-8, of string characters has been implemented already. The user does not have to implement any encoding and decoding methods, but let the program know which encoding and decoding methods to use. Some of the data which could be implicitly converted to strings could also be saved using the text mode. However, the data which could not be converted to strings could not be saved using the text mode. When there is more than one value to be saved into the file, it is the user’s responsibility to parse the text. Usually the user would use some special delimiters such as \n to separate different values. #### Reading File Similarly, because the base unit of the text is character. When it comes to reading the file using the text mode, the program would just have to read the file byte by byte, and decode each byte to character using the decoding method the user specified. #### Expected Output of the Example Using Text Mode 00101101 00110001 00110000 00110000 00110000 00110000 00110001 00110000 00110000 01010111 01000101 00000000  When the computer see the binary sequence from the text file, since each byte is a character, it would just decode byte to character one by one. #### Code Example We implemented a code example textIO.cpp for saving data in binary mode. #include <iostream> #include <fstream> #include <bitset> #include <string> #include <cstdlib> void printBitSequenceFromFile(std::string filename) { std::fstream fhand(filename, fhand.binary | fhand.in); // One char is exactly one byte char c; while (fhand.read(reinterpret_cast<char*>(&c), sizeof(c))) { if (!fhand.eof()) { std::cout << std::bitset<8>(c) << " "; } } std::cout << std::endl; fhand.close(); } int main() { std::string filename = "data.txt"; char delimiter = '\n'; signed int a = -10000; unsigned short b = 100; const char c[] = "WE"; const size_t str_size = sizeof(c); std::cout << "Encoding values:" << std::endl; std::cout << a << std::endl; std::cout << b << std::endl; std::cout << c << std::endl; std::fstream fhand; // trunc will clear the file fhand.open(filename, fhand.trunc | fhand.out); // Implicitly write character by character fhand << a << delimiter; fhand << b << delimiter; fhand.write(c, str_size); fhand << delimiter; fhand.close(); std::cout << "Bit sequence in the file: " << std::endl; printBitSequenceFromFile(filename); char d_str; signed int d; char e_str; unsigned short e; char f[str_size]; fhand.open(filename, fhand.in); fhand.getline(d_str, 255, delimiter); fhand.getline(e_str, 255, delimiter); fhand.getline(f, str_size, delimiter); // Convert C string back to the original type d = static_cast<signed int>(atoi(d_str)); e = static_cast<unsigned short>(atoi(e_str)); std::cout << "Decoded values:" << std::endl; std::cout << d << std::endl; std::cout << e << std::endl; std::cout << f << std::endl; fhand.close(); }  We compiled the above code using the following command. $ g++ textIO.cpp -o textIO


We ran the executable and got the following outputs.

$./textIO Encoding values: -10000 100 WE Bit sequence in the file: 00101101 00110001 00110000 00110000 00110000 00110000 00001010 00110001 00110000 00110000 00001010 01010111 01000101 00000000 00001010 Decoded values: -10000 100 WE  Note that 00001010 is the delimiter \n. Except the three delimiters we inserted, everything else matches our expectation. The size of the saved file is exactly 15 bytes. Even if we do not count the three delimiters inserted, the size would be 12 bytes. $ ls -lh data.txt
-rw-r--r-- 1 leimao leimao 15 Dec 22 15:34 data.txt


### Conclusions

Writing data using the binary mode takes smaller disk or memory sizes compare to writing data using the text mode. That’s why large data storage and low latency file transmission often use binary formats.

The shortcoming of the binary mode is that you should know the data structure and the exact methods for decoding the data. Implementation of the decoding method for each specific data structure would be time consuming. However, with the rise of libraries for handling the binary encoding and decoding for different data structures, such as Google’s Protocol Buffer, we could handle the writing and reading for binary files more easily for most of the common data structures.