Algorithms For Dummies

Download 7,18 Mb.

Pdf ko'rish

bet	457/651
Sana	15.07.2021
Hajmi	7,18 Mb.
	#120357

1 ... 453 454 455 456 457 458 459 460 ... 651

Bog'liq
Algorithms

»

Encoding frequent long sequences of characters efficiently: This is similar

to shrinking long sequences of identical bits, but it works with character

sequences rather than single characters. This is the strategy used by LZW,

which learns data patterns on the fly and creates a short encoding for long

sequences of characters.

272

PART 4

Struggling with Big Data

To understand how rethinking encoding can help in compression, start with the

first reason. Scientists working on the Genome Project around 2008 (

https://

www.genome.gov/10001772/all-about-the--human-genome-project-hgp/

)

managed to drastically reduce the size of their data using a simple encoding trick.

Using this trick made the task of mapping the entire human DNA simpler, helping

the scientists understand more about the life, disease, and death scripted into our

body cells.

Scientists describe DNA using sequences of the letters A, C, T, and G (representing

the four nucleotides present in all living beings). The human genome contains six

billion nucleotides (you find them associated in couples, called bases) that add up

to more than 50GB using ASCII encoding. In fact, you can represent A, C, T, and G

in ASCII encoding as follows:

print (' '.join(['{0:08b}'.format(ord(l))

for l in "ACTG"]))

01000001 01000011 01010100 01000111

The sum of the preceding line is 32 bits, but because DNA maps just four charac-

ters, you can use 2 bits each, saving 75 percent of the previously used bits:

00 01 10 11

Such a gain demonstrates the reason to choose the right encoding. The encoding

works fine in this case because the DNA alphabet is made of four letters, and using

a full ASCII table based on 8 bits is overkill. If a problem requires that you use the

complete ASCII alphabet, you can’t compress the data by redefining the encoding

used. Instead, you have to approach the problem using Huffman compression.

If you can’t shrink the character encoding (or you have already done it), you can

still shrink long sequences, reducing them to a simpler encoding. Observe how

binary data can repeat long sequences of ones and zeros:

00000000 00000000 01111111 11111111 10000011 11111111

In this case, the sequence starts from zero. You can therefore count the number of

zeros, and then count the number of ones that follow, and then repeat with the

next count of zeros, and so on. Because the sequence has only zeros and ones, you

can count them and obtain sequence of counts to compress the data. In this case,

the data compresses into values of 17 15 5 10. Translating these counts into bytes

shortens the initial data in an easily reversible way:

00010001 00001111 00000101 00001010

CHAPTER 14

Download 7,18 Mb.

Do'stlaringiz bilan baham:

1 ... 453 454 455 456 457 458 459 460 ... 651