Linux with Operating System Concepts



Download 5,65 Mb.
Pdf ko'rish
bet48/254
Sana22.07.2022
Hajmi5,65 Mb.
#840170
1   ...   44   45   46   47   48   49   50   51   ...   254
Bog'liq
Linux-with-Operating-System-Concepts-Fox-Richard-CRC-Press-2014

version
/
kernel
/
drivers
where 
version
is the Linux version installed.
3.8 FILE COMPRESSION
We wrap up this chapter with a brief examination of file compression and the tools avail-
able in Linux. The reason to use file compression is to reduce the impact that a file’s size has 
on the available disk storage and during Internet transfer.
3.8.1 Types of File Compression
There are two forms of compression that we use to reduce file size: 
lossy
and 
lossless
com-
pression. In lossy compression, information is actually discarded from the file to reduce 
its size. This loss cannot be regained and so if the information is important, you have 
degraded the file’s information. We typically do this for multimedia files where the loss is 
either not noticeable or where the loss is acceptable. Audio lossy compression, for instance
will often eliminate data of sounds that are outside or at the very range of human hear-
ing as such audio frequencies are not usually missed. The JPEG image compression may 
result in a slight blurring of the image as pixels are grouped together. The JPEG format can 
reduce a file’s size to as much as one-fifth of the original file at the expense of having an 
image that does not match the original’s quality.
There are specialized formats for video, streaming video, and human speech record-
ings. For video, one idea is to encode the starting frame’s image and then describe in a 
frame-by-frame manner the changes that are taking place. The idea is that we can describe 
the change between frame i and frame i 
+
1 much more succinctly than we can describe 
frame i 
+
1 in its entirety or in a compressed way. Certain frames are stored as true images 
because, starting with the given frame, a new image/view/perspective is being presented in 
the video. Lossy compression of video, image, and audio files can reduce file sizes as much 
as 10-fold for audio and images and 100-fold for video!
Lossless compression is file compression technique whereby the contents of the file are 
being changed but the original file can be restored at a later time. Thus, the compressed file 
does not result in a loss of data. Most lossless compression algorithms revolve around the 
idea of searching for and exploiting redundant or repetitive data found in the original file. 
While this works fine for text files, it can also be applied to binary files although in many 
cases with less success. As many of our Linux data files are text files, we might look to com-
press these files to save disk space. We will also bundle together files using an archive pro-
gram like 
tar
. As we will see in Chapter 13, we will almost always compress and bundle 
source code to make the transfer of the numerous files quicker and easier. We look at file 
bundling and the tar program in Chapters 10 and 13.
3.8.2 The LZ Algorithm for Lossless Compression
There are numerous lossless compression algorithms available. Most are based on the idea 
of searching for repeated strings in the original file and then replacing those strings with 


114

Linux with Operating System Concepts
shorter strings. The first example of such an algorithm is known as the Lempel Ziv (LZ) 
algorithm. The LZ algorithm searches a text file for strings and creates a dictionary of all 
strings found. Two common forms of this algorithm are known as LZ77 and LZ78, based 
on papers published by the two authors (Lempel and Ziv) in 1977 and 1978, respectively. 
The two algorithms work very similarly by replacing each string in the given text file with 
the location in the dictionary of the string’s entry. As not all strings are replaced, our final 
compressed file will consist of literal characters and replacement locations.
Let us assume for an example that we are going to only search for words in a text file. 
All blank spaces and punctuation marks will remain in the file. Starting from the begin-
ning of the file, we take each new word found and add it to a growing dictionary. Once we 
have compiled our dictionary, we start again at the beginning of the file and this time we 
replace each word with its location in the dictionary. We will assume that the locations will 
be integer numbers.
Imagine that we have a file of 2000 words in which only 250 are unique. Our dictionary 
then will consist of 250 entries. We replace the 2000 words in the dictionary with a number 
between 1 and 250 (or 0–249 if you prefer). As each integer number in this range can be 
stored in 1 byte, we are replacing longer words with single bytes. Our new text file might 
look something like the following. Remember that words are stored in alphabetic order so 
the sequence of numbers in our example looks random.
205 51 67 147 98 104 16! 205 51 88 140 221. 205 14 171 34 51 67 12 199.
Notice the replaced version of the text still retains spaces and punctuation marks. How 
much savings does our compressed version provide for us?
Let us assume that the 250 words average 5 characters apiece. Stored in ASCII, the 
2000 words then make up 2000 
×

=
10,000 bytes. We add to this the spaces and punc-
tuation marks, also stored in ASCII. With 2000 words, there will be approximately 2000 
spaces and punctuation marks. Let us assume in fact 2250 total spaces and punctuation 
marks, requiring a further 2250 bytes of storage. So our total file is some 12,250 bytes 
in length. The compressed version will replace each of the 2000 words with a single byte 
integer. Thus, the 10,000 bytes is now reduced to 2000 bytes. We also have the 2250 bytes 
for spaces and punctuation marks, or 4250. We are not quite done though. We also have 
to store the dictionary itself, which consists of 250 words with an average of 5 bytes per 
word, or 1250 bytes. Our compressed file is 5500 bytes long. Our compressed file is then 
5500/12,250 
=
44.9%, or less than half the original’s size.
Could we do better? Yes, there are many additional tricks we can play to help reduce 
the size. For instance, we did not include any blank spaces and yet nearly every word will 
have a blank or punctuation mark afterward. If we were to add these to our words so that, 
for instance, we are storing “The” rather than “The”, we would reduce the compressed 
file’s size even more although we would be increasing both the number of entries in the 
dictionary and the length of each string in the dictionary. We could also search for strings 
irrelevant of word boundaries. For instance, we might find the string “the” which is found 
in many words like “the,” “they,” “then,” “them,” “anthem,” “lathe,” and “blithe.” More 


Navigating the Linux File System

115
complex algorithms will look for the most common substrings and use those rather than 
full words. Punctuation marks and digits can also be incorporated into these strings.
The lossless compression algorithms trade off search time for compressed file size. The 
more time the algorithm can take to compress a file, the more likely the resulting file will be 
even smaller. A user, compressing or uncompressing a file, may not want to wait a lengthy 
period of time. And therefore, the algorithm will have to search fewer combinations result-
ing in a less compressed file.
3.8.3 Other Lossless Compression Algorithms
Aside from the LZ algorithm, another popular algorithm is Burrows–Wheeler. This algo-
rithm first sorts the text in strings to reorder the characters so that repeated characters 
appear multiple times. This allows those repetitions to be encoded into shorter strings. To 
obtain the greatest amount of compression, the algorithm looks for the longest repeated 
substrings within a group of characters (one word, multiple words, phrase, sentence). Once 
rearranged, compression can take place. Compression uses a substitution code like Huffman 
coding. In Huffman coding, strings are replaced such that the more commonly occurring 
strings have smaller codes. For instance, if the string “the” appears more times than any-
thing else, it might be given the code 110 while less commonly occurring strings might have 
four, five, six, seven, eight, or even nine bits like 1000, 11100, 101011, and so forth.
In Huffman coding, we have to make sure that codes can be uniquely identified when 
scanned from left to right. For instance, we could not use 00, 01, 10, 11, 000, 001, 011, 101 
in our code because the sequence 000101101 would not have a unique interpretation. We 
might think that 0001011 is 00 followed by 01 followed by 011 followed by 01, or it could be 
interpreted as 000 followed by 101 followed by 101. A code must be interpretable without 
such a conflict. We might therefore pick a better code using sequences like the following:
• Most common string: 000
• Second most common string: 001
• Third most common string: 010
• Fourth most common string: 011
• All other strings start with a 1 bit and are at least 4 bits long
In this way, 00000101100101010100001010101011101010101 could clearly be identified as 
starting with 000 001 011 001 010 10100001010101011101010101. As we have not enumer-
ated the rest of the code, after the first five encoded strings, it is unclear what we have, but 
the first five are clearly identifiable.
In Linux, there are several programs available for performing compression and decom-
pression on text files. The most popular two are 
gzip
and 
bzip2
. The gzip program is 
based on the LZ77 and Huffman codes. The use of LZ77 creates a dictionary and the use of 
Huffman codes permits the most commonly occurring strings to be encoded in as few bits 
as possible. The operation is 
gz 

Download 5,65 Mb.

Do'stlaringiz bilan baham:
1   ...   44   45   46   47   48   49   50   51   ...   254




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish