The use of fixed-sized character sequences leaves a lot of room for improvement.
You may not use all the letters in an alphabet, or you use some letters more
268
PART 4
Struggling with Big Data
than others. This is where compression comes into play. By using variable-length
character sequences, you can greatly reduce the size of a file. However, the file
also requires additional processing to turn it back into an uncompressed format
that applications understand. Compression removes space in an organized and
methodical manner; decompression adds the space back into the character strings.
When it’s possible to compress and decompress data in a manner that doesn’t
result in any data loss, you’re using lossless compression.
The same idea behind compression goes for images and sounds that involve fram-
ing sequences of bits of a certain size in order to represent video details or to
reproduce a second of a sound using the computer’s speakers. Videos are simply
sequences of bits, and each bit sequence is a pixel, which is composed of small
points that constitute an image. Likewise, audio is composed of sequences of bits
that represent an individual sample. Audio files store a certain number of samples
per second to recreate a sound. The discussion at
http://kias.dyndns.org/
comath/44.html
provides more information about both video and audio storage.
Computers store data in many predefined formats of long sequences of bits (com-
monly called bit streams). Compression algorithms can exploit the way each for-
mat works to obtain the same result using a shorter, custom format.
You can compress data that represents images and sounds further by eliminating
details that you can’t process. Humans have both visual and aural limits, so they
aren’t likely to notice the loss of detail imposed by compressing the data in spe-
cific ways. You may have heard of MP3 compression that allows you to store entire
collections of CDs on your computer or on a portable reader. The MP3 file format
simplifies the original cumbersome WAV format used by computers. WAV
files contain all the sound waves received by the computer, but MP3 saves space
by removing and compacting waves that you can’t hear. (For more more
about MP3, see the article at
http://arstechnica.com/features/2007/10/the-
audiofile-understanding-mp3-compression/
).
Removing details from data creates lossy compression. JPEG, DjVu, MPEG, MP3,
and WMA are all lossy compression algorithms specialized in a particular kind of
media data (images, video, sound), and there are many others. Lossy compression
is fine for data meant for human input; however, by removing the details, you
can’t revert to the original data structure. Thus, you can get good digital photo
compression and represent it in a useful way on a computer’s screen. Yet when
you print the compressed photo on paper, you may notice that the quality, though
acceptable, is not as good as the original picture. The display provides output at
96 dots per inch (dpi), but a printer typically provides output at 300 to 1200 dpi (or
higher). The effects of lossy compression become obvious because a printer is able
to display them in a manner that humans can see.