Linux with Operating System Concepts

Download 5,65 Mb.

Pdf ko'rish

bet	48/254
Sana	22.07.2022
Hajmi	5,65 Mb.
	#840170

1 ... 44 45 46 47 48 49 50 51 ... 254

Bog'liq
Linux-with-Operating-System-Concepts-Fox-Richard-CRC-Press-2014

version
/
kernel
/
drivers
where
version
is the Linux version installed.
3.8 FILE COMPRESSION
We wrap up this chapter with a brief examination of file compression and the tools avail-
able in Linux. The reason to use file compression is to reduce the impact that a file’s size has
on the available disk storage and during Internet transfer.
3.8.1 Types of File Compression
There are two forms of compression that we use to reduce file size:
lossy
and
lossless
com-
pression. In lossy compression, information is actually discarded from the file to reduce
its size. This loss cannot be regained and so if the information is important, you have
degraded the file’s information. We typically do this for multimedia files where the loss is
either not noticeable or where the loss is acceptable. Audio lossy compression, for instance,
will often eliminate data of sounds that are outside or at the very range of human hear-
ing as such audio frequencies are not usually missed. The JPEG image compression may
result in a slight blurring of the image as pixels are grouped together. The JPEG format can
reduce a file’s size to as much as one-fifth of the original file at the expense of having an
image that does not match the original’s quality.
There are specialized formats for video, streaming video, and human speech record-
ings. For video, one idea is to encode the starting frame’s image and then describe in a
frame-by-frame manner the changes that are taking place. The idea is that we can describe
the change between frame i and frame i
+
1 much more succinctly than we can describe
frame i
+
1 in its entirety or in a compressed way. Certain frames are stored as true images
because, starting with the given frame, a new image/view/perspective is being presented in
the video. Lossy compression of video, image, and audio files can reduce file sizes as much
as 10-fold for audio and images and 100-fold for video!
Lossless compression is file compression technique whereby the contents of the file are
being changed but the original file can be restored at a later time. Thus, the compressed file
does not result in a loss of data. Most lossless compression algorithms revolve around the
idea of searching for and exploiting redundant or repetitive data found in the original file.
While this works fine for text files, it can also be applied to binary files although in many
cases with less success. As many of our Linux data files are text files, we might look to com-
press these files to save disk space. We will also bundle together files using an archive pro-
gram like
tar
. As we will see in Chapter 13, we will almost always compress and bundle
source code to make the transfer of the numerous files quicker and easier. We look at file
bundling and the tar program in Chapters 10 and 13.
3.8.2 The LZ Algorithm for Lossless Compression
There are numerous lossless compression algorithms available. Most are based on the idea
of searching for repeated strings in the original file and then replacing those strings with

114
◾
Linux with Operating System Concepts
shorter strings. The first example of such an algorithm is known as the Lempel Ziv (LZ)
algorithm. The LZ algorithm searches a text file for strings and creates a dictionary of all
strings found. Two common forms of this algorithm are known as LZ77 and LZ78, based
on papers published by the two authors (Lempel and Ziv) in 1977 and 1978, respectively.
The two algorithms work very similarly by replacing each string in the given text file with
the location in the dictionary of the string’s entry. As not all strings are replaced, our final
compressed file will consist of literal characters and replacement locations.
Let us assume for an example that we are going to only search for words in a text file.
All blank spaces and punctuation marks will remain in the file. Starting from the begin-
ning of the file, we take each new word found and add it to a growing dictionary. Once we
have compiled our dictionary, we start again at the beginning of the file and this time we
replace each word with its location in the dictionary. We will assume that the locations will
be integer numbers.
Imagine that we have a file of 2000 words in which only 250 are unique. Our dictionary
then will consist of 250 entries. We replace the 2000 words in the dictionary with a number
between 1 and 250 (or 0–249 if you prefer). As each integer number in this range can be
stored in 1 byte, we are replacing longer words with single bytes. Our new text file might
look something like the following. Remember that words are stored in alphabetic order so
the sequence of numbers in our example looks random.
205 51 67 147 98 104 16! 205 51 88 140 221. 205 14 171 34 51 67 12 199.
Notice the replaced version of the text still retains spaces and punctuation marks. How
much savings does our compressed version provide for us?
Let us assume that the 250 words average 5 characters apiece. Stored in ASCII, the
2000 words then make up 2000
×
5
=
10,000 bytes. We add to this the spaces and punc-
tuation marks, also stored in ASCII. With 2000 words, there will be approximately 2000
spaces and punctuation marks. Let us assume in fact 2250 total spaces and punctuation
marks, requiring a further 2250 bytes of storage. So our total file is some 12,250 bytes
in length. The compressed version will replace each of the 2000 words with a single byte
integer. Thus, the 10,000 bytes is now reduced to 2000 bytes. We also have the 2250 bytes
for spaces and punctuation marks, or 4250. We are not quite done though. We also have
to store the dictionary itself, which consists of 250 words with an average of 5 bytes per
word, or 1250 bytes. Our compressed file is 5500 bytes long. Our compressed file is then
5500/12,250
=
44.9%, or less than half the original’s size.
Could we do better? Yes, there are many additional tricks we can play to help reduce
the size. For instance, we did not include any blank spaces and yet nearly every word will
have a blank or punctuation mark afterward. If we were to add these to our words so that,
for instance, we are storing “The” rather than “The”, we would reduce the compressed
file’s size even more although we would be increasing both the number of entries in the
dictionary and the length of each string in the dictionary. We could also search for strings
irrelevant of word boundaries. For instance, we might find the string “the” which is found
in many words like “the,” “they,” “then,” “them,” “anthem,” “lathe,” and “blithe.” More

Navigating the Linux File System
◾
115
complex algorithms will look for the most common substrings and use those rather than
full words. Punctuation marks and digits can also be incorporated into these strings.
The lossless compression algorithms trade off search time for compressed file size. The
more time the algorithm can take to compress a file, the more likely the resulting file will be
even smaller. A user, compressing or uncompressing a file, may not want to wait a lengthy
period of time. And therefore, the algorithm will have to search fewer combinations result-
ing in a less compressed file.
3.8.3 Other Lossless Compression Algorithms
Aside from the LZ algorithm, another popular algorithm is Burrows–Wheeler. This algo-
rithm first sorts the text in strings to reorder the characters so that repeated characters
appear multiple times. This allows those repetitions to be encoded into shorter strings. To
obtain the greatest amount of compression, the algorithm looks for the longest repeated
substrings within a group of characters (one word, multiple words, phrase, sentence). Once
rearranged, compression can take place. Compression uses a substitution code like Huffman
coding. In Huffman coding, strings are replaced such that the more commonly occurring
strings have smaller codes. For instance, if the string “the” appears more times than any-
thing else, it might be given the code 110 while less commonly occurring strings might have
four, five, six, seven, eight, or even nine bits like 1000, 11100, 101011, and so forth.
In Huffman coding, we have to make sure that codes can be uniquely identified when
scanned from left to right. For instance, we could not use 00, 01, 10, 11, 000, 001, 011, 101
in our code because the sequence 000101101 would not have a unique interpretation. We
might think that 0001011 is 00 followed by 01 followed by 011 followed by 01, or it could be
interpreted as 000 followed by 101 followed by 101. A code must be interpretable without
such a conflict. We might therefore pick a better code using sequences like the following:
• Most common string: 000
• Second most common string: 001
• Third most common string: 010
• Fourth most common string: 011
• All other strings start with a 1 bit and are at least 4 bits long
In this way, 00000101100101010100001010101011101010101 could clearly be identified as
starting with 000 001 011 001 010 10100001010101011101010101. As we have not enumer-
ated the rest of the code, after the first five encoded strings, it is unclear what we have, but
the first five are clearly identifiable.
In Linux, there are several programs available for performing compression and decom-
pression on text files. The most popular two are
gzip
and
bzip2
. The gzip program is
based on the LZ77 and Huffman codes. The use of LZ77 creates a dictionary and the use of
Huffman codes permits the most commonly occurring strings to be encoded in as few bits
as possible. The operation is
gz

Download 5,65 Mb.

Do'stlaringiz bilan baham:

1 ... 44 45 46 47 48 49 50 51 ... 254