Compressing Data
267
knows data only as bits because it has only circuitry to store bits. However, from
a higher point of view, computer software can interpret bits as letters, ideograms,
pictures, films, and sounds, which is where encoding comes into play.
Encoding uses a sequence of bits to represent something other than the number
expressed by the sequence itself. For instance, you can represent a letter using a
particular sequence of bits. Computer software commonly represents the letter A
using the number 65, or binary 01000001 when working with the American Stan-
dard Code for Information Interchange (ASCII) encoding standard. You can see
sequences used by ASCII system at
http://www.asciitable.com/
. ASCII uses just
7 bits for its encoding (8 bits, or a byte, in the extended version), which means
that you can represent 128 different characters (the extended version has
256 characters). Python can represent the string “Hello World” using bytes:
print (''.join(['{0:08b}'.format(ord(l))
for l in "Hello World"]))
0100100001100101011011000110110001101111001000000101011101
101111011100100110110001100100
When using extended ASCII, a computer knows that a sequence of exactly 8 bits
represent a character. It can separate each sequence into 8-bit bytes and, using a
conversion table called a symbolic table, it can turn these bytes into characters.
ASCII encoding can represent the standard Western alphabet, but it doesn’t sup-
port the variety of accented European characters or the richness of non-European
alphabets, such as the ideograms used by the Chinese and Japanese languages.
Chances are that you’re using a robust encoding system such as UTF-8 or another
form of Unicode encoding (see
http://unicode.org/
for more information).
Unicode encoding is the default encoding in Python 3.
Using a complex encoding system requires that you use longer sequences than
those required by ASCII. Depending on the encoding you choose, defining a
character may require up to 4 bytes (32 bits). When representing textual informa-
tion, a computer creates long bit sequences. It decodes each letter easily because
encoding uses fixed-length sequences in a single file. Encoding strategies, such as
Unicode Transformation Format 8 (UTF-8), can use variable numbers of bytes
(1 to 4 in this case). You can read more about how UTF-8 works at
http://www.
fileformat.info/info/unicode/utf8.htm
.
Do'stlaringiz bilan baham: |