114
◾
Linux with Operating System Concepts
shorter strings. The first example of such an algorithm is known as the Lempel Ziv (LZ)
algorithm. The LZ algorithm searches a text file for strings and creates a dictionary of all
strings found. Two common forms of this algorithm are known as LZ77 and LZ78, based
on papers published by the two authors (Lempel and Ziv) in 1977 and 1978, respectively.
The two algorithms work very similarly by replacing each string in the given text file with
the location in the dictionary of the string’s entry. As not all strings are replaced, our final
compressed file will consist of literal characters and replacement locations.
Let us assume for an example that we are going to only search for words in a text file.
All blank spaces and punctuation marks will remain in the file. Starting from the begin-
ning of the file, we take each new word found and add it to a growing dictionary. Once we
have
compiled our dictionary, we start again at the beginning of the file and this time we
replace each word with its location in the dictionary. We will assume that the locations will
be integer numbers.
Imagine that we have a file of 2000 words in which only 250 are unique. Our dictionary
then will consist of 250 entries. We replace the 2000 words in the dictionary with a number
between 1 and 250 (or 0–249 if you prefer). As each integer number in this range can be
stored in 1 byte, we are replacing longer words with single bytes. Our new text file might
look something like the following. Remember that words are stored in alphabetic order so
the sequence of numbers in our example looks random.
205 51 67 147 98 104 16! 205 51 88 140 221. 205 14 171 34 51 67 12 199.
Notice the replaced version of the text still retains spaces and punctuation marks. How
much savings does our compressed version provide for us?
Let us assume that the 250 words average 5 characters apiece. Stored in ASCII, the
2000 words then make up 2000
×
5
=
10,000 bytes. We add to this the spaces and punc-
tuation marks, also stored in ASCII. With 2000 words, there will be approximately 2000
spaces and punctuation marks. Let us assume in fact 2250 total
spaces and punctuation
marks, requiring a further 2250 bytes of storage. So our total file is some 12,250 bytes
in length. The compressed version will replace each of the 2000 words with a single byte
integer. Thus, the 10,000 bytes is now reduced to 2000 bytes. We also have the 2250 bytes
for spaces and punctuation marks, or 4250. We are not quite done though. We also have
to store the dictionary itself, which consists of 250 words with an average of 5 bytes per
word, or 1250 bytes. Our compressed file is 5500 bytes long. Our compressed file is then
5500/12,250
=
44.9%, or less than half the original’s size.
Could we do better? Yes, there are many additional tricks we can play to help reduce
the size. For instance, we did not include any blank spaces and yet nearly every word will
have a blank or punctuation mark afterward. If we were to add these to our words so that,
for instance, we are storing “The” rather than “The”, we
would reduce the compressed
file’s size even more although we would be increasing both the number of entries in the
dictionary and the length of each string in the dictionary. We could also search for strings
irrelevant of word boundaries. For instance, we might find the string “the” which is found
in many words like “the,” “they,” “then,” “them,” “anthem,” “lathe,” and “blithe.” More
Navigating the Linux File System
◾
115
complex algorithms will look for the most common substrings and use those rather than
full words. Punctuation marks and digits can also be incorporated into these strings.
The lossless compression algorithms trade off search time for compressed file size. The
more time the algorithm can take to compress a file, the more likely the resulting file will be
even smaller. A user, compressing or uncompressing a file, may
not want to wait a lengthy
period of time. And therefore, the algorithm will have to search fewer combinations result-
ing in a less compressed file.
3.8.3 Other Lossless Compression Algorithms
Aside from the LZ algorithm, another popular algorithm is Burrows–Wheeler. This algo-
rithm first sorts the text in strings to reorder the characters so that repeated characters
appear multiple times. This allows those repetitions to be encoded into shorter strings. To
obtain the greatest amount of compression, the algorithm looks for the longest repeated
substrings within a group of characters (one word, multiple words, phrase, sentence). Once
rearranged, compression can take place. Compression uses a substitution code like Huffman
coding.
In Huffman coding, strings are replaced such that the more commonly occurring
strings have smaller codes. For instance, if the string “the” appears more times than any-
thing else, it might be given the code 110 while less commonly occurring strings might have
four, five, six, seven, eight, or even nine bits like 1000, 11100, 101011, and so forth.
In Huffman coding, we have to make sure that codes can be uniquely identified when
scanned from left to right. For instance, we could not use 00, 01, 10, 11, 000, 001, 011, 101
in our code because the sequence 000101101 would not have a unique interpretation. We
might think that 0001011 is 00 followed by 01 followed by 011 followed by 01, or it could be
interpreted as 000 followed by 101 followed by 101. A code must be interpretable without
such a conflict. We might therefore pick a better code using sequences like the following:
• Most common string: 000
• Second most common string: 001
• Third most common string: 010
• Fourth most common string: 011
• All other strings start with a 1 bit and are at least 4
bits long
In this way, 00000101100101010100001010101011101010101 could clearly be identified as
starting with 000 001 011 001 010 10100001010101011101010101. As we have not enumer-
ated the rest of the code, after the first five encoded strings, it is unclear what we have, but
the first five are clearly identifiable.
In Linux, there are several programs available for performing compression and decom-
pression on text files. The most popular two are
gzip
and
bzip2
. The gzip program is
based on the LZ77 and Huffman codes. The use of LZ77 creates a dictionary and the use of
Huffman codes permits the most commonly occurring strings to be encoded in as few bits
as possible. The operation is
gz
Do'stlaringiz bilan baham: