272
PART 4
Struggling with Big Data
To understand how rethinking encoding can help in compression, start with the
first reason. Scientists working on the Genome Project around 2008 (
https://
www.genome.gov/10001772/all-about-the--human-genome-project-hgp/
)
managed to drastically reduce the size of their data using a simple encoding trick.
Using this trick made the task of mapping the entire human DNA simpler, helping
the scientists understand more about the life, disease, and death scripted into our
body cells.
Scientists describe DNA using sequences of the letters A, C, T, and G (representing
the four nucleotides present in all living beings). The human genome contains six
billion nucleotides (you find them associated in couples, called bases) that add up
to more than 50GB using ASCII encoding. In fact, you can represent A, C, T, and G
in ASCII encoding as follows:
print (' '.join(['{0:08b}'.format(ord(l))
for l in "ACTG"]))
01000001 01000011 01010100 01000111
The sum of the preceding line is 32 bits, but because DNA maps just four charac-
ters, you can use 2 bits each, saving 75 percent of the previously used bits:
00 01 10 11
Such a gain demonstrates the reason to choose the right encoding. The encoding
works fine in this case because the DNA alphabet is made of four letters, and using
a full ASCII table based on 8 bits is overkill. If a problem requires that you use the
complete ASCII alphabet, you can’t compress the data by redefining the encoding
used. Instead, you have to approach the problem using Huffman compression.
If you can’t shrink the character encoding (or you have already done it), you can
still shrink long sequences, reducing them to a simpler encoding. Observe how
binary data can repeat long sequences of ones and zeros:
00000000 00000000 01111111 11111111 10000011 11111111
In this case, the sequence starts from zero. You can therefore count the number of
zeros, and then count the number of ones that follow, and then repeat with the
next count of zeros, and so on. Because the sequence has only zeros and ones, you
can count them and obtain sequence of counts to compress the data. In this case,
the data compresses into values of 17 15 5 10. Translating these counts into bytes
shortens the initial data in an easily reversible way:
00010001 00001111 00000101 00001010