Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

When we try to read or write files in our program, usually there are two modes to use. Text mode, usually by default, and binary mode. Obviously, in text mode, the program writes data to file as text characters, and in binary mode, the program writes data to files as 0/1 bits. While it sounds trivial to distinguish the two modes, people sometimes got confused. Since the computer only reads and writes in binary formats, where is this text mode coming from?


In this blog post, I am going to talk about the conceptual difference between the text mode and the binary model, and discuss some caveats of using them.

Example

We have a signed int -10000, an unsigned shot 100, and a C string WE. Their binary sequence representations in an 64-bit computer is as follows.


signed int -10000:

11111111 11111111 11011000 11110000

std::string -10000:

00101101 00110001 00110000 00110000 00110000 00110000

unsigned shot 100:

00000000 01100100

std::string 100:

00110001 00110000 00110000

C string WE:

01010111 01000101 00000000

Note that C string always has a 0 at the end of the string.


When we save the three values to file, the binary sequence of the file is simply a concatenation of all the values.

Binary Mode

The binary mode is very easy to understand. For each piece of the data on the computer, they are represented as binary sequences on the memory or hard drive.

Writing File

To save the data in binary mode, we simply take the exact binary sequence representing the data, and save it to the file. Nothing fancy.

Reading File

Because the saved file has no knowledge about the data structure of its content, to read the data saved in binary mode, the users would need to implement the decoding method themselves.

Expected Output of the Example Using Binary Mode

11111111 11111111 11011000 11110000 00000000 01100100 01010111 01000101 00000000

When the computer sees the binary sequence from the binary file, it would have no clue to decode it back to the original values. It is our users’ responsibility to tell the computer, the first 4 bytes represent a signed int, the next 2 bytes represent an unsigned shot, and the next 3 bytes represent a C string, so that the computer would know how to decode.

Code Example

We implemented a code example binaryIO.cpp for saving data in binary mode.

#include <iostream>
#include <fstream>
#include <bitset>

void printBitSequenceFromFile(std::string filename)
{
    std::fstream fhand(filename, fhand.binary | fhand.in);
    // One char is exactly one byte
    char c;
    while (fhand.read(reinterpret_cast<char*>(&c), sizeof(c)))
    {
        if (!fhand.eof())
        {
            std::cout << std::bitset<8>(c) << " ";
        }
    }
    std::cout << std::endl;
    fhand.close();
}

int main()
{
    std::string filename = "data.bin";
    signed int a = -10000;
    unsigned short b = 100;
    const char c[] = "WE";
    const size_t str_size = sizeof(c);

    std::cout << "Encoding values:" << std::endl;
    std::cout << a << std::endl;
    std::cout << b << std::endl;
    std::cout << c << std::endl;

    std::fstream fhand;
    // trunc will clear the file
    fhand.open(filename, fhand.binary | fhand.trunc | fhand.out);

    fhand.write(reinterpret_cast<char*>(&a), sizeof(a));
    fhand.write(reinterpret_cast<char*>(&b), sizeof(b));
    fhand.write(c, str_size);

    fhand.close();

    std::cout << "Bit sequence in the file: " << std::endl;
    printBitSequenceFromFile(filename);

    signed int d;
    unsigned short e;
    char f[str_size];

    fhand.open(filename, fhand.binary | fhand.in);

    fhand.read(reinterpret_cast<char*>(&d), sizeof(d));
    fhand.read(reinterpret_cast<char*>(&e), sizeof(e));
    fhand.read(f, str_size);
    std::cout << "Decoded values:" << std::endl;
    std::cout << d << std::endl;
    std::cout << e << std::endl;
    std::cout << f << std::endl;

    fhand.close();
}

We compiled the above code using the following command.

$ g++ binaryIO.cpp -o binaryIO

We ran the executable and got the following outputs.

$ ./binaryIO 
Encoding values:
-10000
100
WE
Bit sequence in the file: 
11110000 11011000 11111111 11111111 01100100 00000000 01010111 01000101 00000000 
Decoded values:
-10000
100
WE

Be aware that the numeric values are saved with the order of bytes reversed, which is an artifact of C/C++. Except for this, everything saved to the file matches our expectation.


The size of the saved file is exactly 9 bytes.

$ ls -lh data.bin 
-rw-r--r-- 1 leimao leimao 9 Dec 22 15:15 data.bin

Text Mode

The text mode is nothing special but converts the data to string format, and use the binary representation of the string to represent the data.

Writing File

Because the encoding and decoding methods, such as ASCII and UTF-8, of string characters have been implemented already. The user does not have to implement any encoding and decoding methods, but let the program know which encoding and decoding methods to use. Some of the data which could be implicitly converted to strings could also be saved using the text mode. However, the data which could not be converted to strings could not be saved using the text mode.


When there is more than one value to be saved into the file, it is the user’s responsibility to parse the text. Usually, the user would use some special delimiters such as \n to separate different values.

Reading File

Similarly, because the base unit of the text is character. When it comes to reading the file using the text mode, the program would just have to read the file byte by byte, and decode each byte to character using the decoding method the user-specified.

Expected Output of the Example Using Text Mode

00101101 00110001 00110000 00110000 00110000 00110000 00110001 00110000 00110000 01010111 01000101 00000000

When the computer sees the binary sequence from the text file, since each byte is a character, it would just decode byte to character one by one.

Code Example

We implemented a code example textIO.cpp for saving data in binary mode.

#include <iostream>
#include <fstream>
#include <bitset>
#include <string>
#include <cstdlib> 

void printBitSequenceFromFile(std::string filename)
{
    std::fstream fhand(filename, fhand.binary | fhand.in);
    // One char is exactly one byte
    char c;
    while (fhand.read(reinterpret_cast<char*>(&c), sizeof(c)))
    {
        if (!fhand.eof())
        {
            std::cout << std::bitset<8>(c) << " ";
        }
    }
    std::cout << std::endl;
    fhand.close();
}

int main()
{
    std::string filename = "data.txt";
    char delimiter = '\n';
    signed int a = -10000;
    unsigned short b = 100;
    const char c[] = "WE";
    const size_t str_size = sizeof(c);

    std::cout << "Encoding values:" << std::endl;
    std::cout << a << std::endl;
    std::cout << b << std::endl;
    std::cout << c << std::endl;

    std::fstream fhand;
    // trunc will clear the file
    fhand.open(filename, fhand.trunc | fhand.out);

    // Implicitly write character by character
    fhand << a << delimiter;
    fhand << b << delimiter;
    fhand.write(c, str_size);
    fhand << delimiter;

    fhand.close();

    std::cout << "Bit sequence in the file: " << std::endl;
    printBitSequenceFromFile(filename);

    char d_str[255];
    signed int d;
    char e_str[255];
    unsigned short e;
    char f[str_size];

    fhand.open(filename, fhand.in);

    fhand.getline(d_str, 255, delimiter);
    fhand.getline(e_str, 255, delimiter);
    fhand.getline(f, str_size, delimiter);

    // Convert C string back to the original type
    d = static_cast<signed int>(atoi(d_str));
    e = static_cast<unsigned short>(atoi(e_str));

    std::cout << "Decoded values:" << std::endl;
    std::cout << d << std::endl;
    std::cout << e << std::endl;
    std::cout << f << std::endl;

    fhand.close();
}

We compiled the above code using the following command.

$ g++ textIO.cpp -o textIO

We ran the executable and got the following outputs.

$ ./textIO
Encoding values:
-10000
100
WE
Bit sequence in the file: 
00101101 00110001 00110000 00110000 00110000 00110000 00001010 00110001 00110000 00110000 00001010 01010111 01000101 00000000 00001010 
Decoded values:
-10000
100
WE

Note that 00001010 is the delimiter \n. Except for the three delimiters we inserted, everything else matches our expectations.


The size of the saved file is exactly 15 bytes. Even if we do not count the three delimiters inserted, the size would be 12 bytes.

$ ls -lh data.txt 
-rw-r--r-- 1 leimao leimao 15 Dec 22 15:34 data.txt

Conclusions

Writing data using the binary mode takes smaller disk or memory sizes comparing to writing data using the text mode. That’s why large data storage and low latency file transmission often use binary formats.


The shortcoming of the binary mode is that you should know the data structure and the exact methods for decoding the data. Implementation of the decoding method for each specific data structure would be time consuming. However, with the rise of libraries for handling the binary encoding and decoding for different data structures, such as Google’s Protocol Buffer, we could handle the writing and reading for binary files more easily for most of the common data structures.

References