part. Such numbers have a fixed number of digits to the right of the point.
For example, the decimal number 12.345 has three digits in its fraction part. Its value will not
change if the format is changed and the fraction part increases. If the number is changed to
XXXX.YYYY (where X represents integers and Y represents the fraction digits), the number will
change, but its value remains the same: 12.345 = 0012.3450.
These rules are relevant to any numbering system and not only to decimal numbers. The decimal
number 12.25 when converted to a binary number is 1100.01
2
. In this case, changing the fixed-point
format will have no effect on the value, assuming of course the format provides the necessary space
for all digits.
As such, the equation below is correct:
However, using fixed-point numbers has some inherent limitations, especially considering the
limited nature of the numbers in computers due to the fixed number of bits. An explicit limitation
becomes evident when there is a need to express very large or very small (close to zero) values. For
that reason, there is another format for representing numbers that is based on the concept of
floating point. Most computers use floating point as the dominant mechanism for representing real
numbers, while integers are represented using the standard binary numbering system. To
understand the concept of floating point and how it is used in computers, we will start with a brief
discussion of scientific notation.
Scientific Notation
When we need to define very large numbers (or alternatively very small numbers), there are some
techniques for eliminating or reducing human error. These errors are caused mainly when the
numbers include many digits and especially if these are repeating digits. One useful technique is to
use thousands separators, so the human eye can easily synchronize even if there are many identical
digits. These separators do not change the value and are intended only for clarity. Another example,
which has been used already in this book, is to add spaces between groups of binary numbers. It
should be noted that when using very large binary numbers, this is a major problem due to the
length of the numbers and the fact there are only two digits that repeat.
For example, the decimal number 498000000000, which includes many similar digits that repeat,
becomes more readable and clear if it is written as 498,000,000,000. However, it is even more readable
if it is written as 4.98 * 10
11
.
This type of writing is referred to as scientific notation and is relevant to very small number as
well, such as 1.2 * 10
−27
or 3.45 * 10
−39
.
The scientific notation format consists of five components:
1. The number’s sign
2. The number magnitude (sometimes referred to as significand or mantissa), which represents
the significant digits of the number. Usually, the significand has to be normalized, which means
that there is one and only one significant digit to the left of the fraction point.
FIGURE 2.3
Scientific notation’s components.
3. The base of the number
4. The exponent’s sign
5. The value (magnitude) of the exponent
Figure 2.3
is a visual representation of the scientific notation, including its components. The
number used in this figure is −9.056 * 10
−23
.
When the value of the mantissa consists of one and only one digit to the left of the fraction point,
the number is considered to be normalized. Every number has only one normalized format but may
have infinite number of nonnormalized versions.
For example, if we want to represent the fraction 1/1,000,000,000, then
the normalized form will be 1.0 * 10
−9
Nonnormalized numbers are 0.1 * 10
−8
, 10.0 * 10
−10
, 100.0 * 10
−11
, and many others.
It is easy to change nonnormalized numbers into normalized ones, and this is done by changing
the location of the fraction point followed by changing the exponent accordingly. Due to the fact that
the point’s location is not fixed, this format is called floating point. It should be noted, however, that
the value of the number does not change due to the point’s location movement:
Scientific notation does not have to be decimal, and it can be implemented using other bases as
well. Specifically, it can be used for binary numbers as is demonstrated in
Figure 2.4
.
FIGURE 2.4
Binary scientific notation.
Binary scientific notation must, of course, obey the binary rules, and the digits represented in the
mantissa have to be binary digits (zero and one).
During the first decades of computer history, several formats for binary floating point were
developed. Different formats and implementations provided a higher degree of accuracy (if the
mantissa had more bits) or the possibility of supporting a larger range of numbers (if the exponent
was larger). However, the variety of floating-point formats that may have provided some marketing
advantages to a specific vendor became a major deficiency. Since the days of standalone computers
are gone and in the last three decades most computers have become part of a network, connectivity
has become important. Computers have to be able to easily integrate with other systems, and
different formats of floatingpoint representations hamper this connected environment. As a result,
the understanding that a common standard for floating point is needed was born. After several years
of discussions, in 1985 the ANSII/IEEE Standard 754 was established, and since then it has been
adopted by most hardware manufacturers. The standard defined the format for 32 bits and 64 bits
and also had additional formats for different numbers of bits (16, 128).
THE 754 STANDARD
The ANSI/IEEE standard 754 defines a common format for representing floating-point numbers
across many hardware platforms. The standard defines the number of bits allocated for each one of
the scientific notation components as well as the definition of these components. For example, for a
32-bit word, the standard defines it thus:
1. The number’s sign is the leftmost bit and, when it is on, it means that this is a negative number.
2. The number (value or mantissa) is expressed using the rightmost 23 bits of the binary number.
These bits represent the fraction of the normalized binary number. Since a normalized number
contains only one significant digit left of the fraction point, then when dealing with binary
numbers, the only significant digit is one. For that reason, the format used by the 754 standard
does not relate to the significant digit and does not include it as part of the number. One may
say that the standard designers did not want to waste a bit for information, that is obvious. For
that reason, the 23 bits represent only the fraction (mantissa) and not the whole number. The
processor that received the floating-point number “knows” the standard and automatically
adds the missing significant digit before the required calculations. After the calculations are
over and the result is normalized, the processor will strip the significant digit so the number
will follow the standard.
3. Base is another scientific notation component that is not represented in the standard. As with
the significant digit that is omitted, since all numbers are binary numbers, the standard’s
designers did not want to waste bits on an obvious piece of information. Although base is one
of the scientific notation components, it does not exist in the 754 format.
4. The exponent’s sign is another component that exists in scientific notation; however, the
standard did not assign bits for it, and it is part of the exponent value. As with previous cases,
the standard tries to be as efficient as possible and not to waste bits on known or duplicate
information. The idea is to use these bits for increasing the accuracy of the numbers represented
by allocating more bits for the mantissa.
5. The exponent is defined by 8 bits, which are in the middle between the sign bit (on the left) and
the mantissa 23 bits (on the right). This 8-bit number is a binary unsigned value, but the real
exponent is biased by 127; that is, the value 127 is added to the real number of the exponent, and
this is what is stored in the standard.
For example, if the real exponent is 3, then the number that will be used as part of the 754
standard will be 127 + 3 = 130
10
= 82
16
= 1000 0010
2
.
Similarly, if the exponent is (−4), then the 754 exponent will be 0111 1011
2
; this was obtained
by −4 + 127 = 123
10
= 7B
16
= 0111 1011
2
.
This means that although scientific notation is based on 5 components (see “Scientific Notation”
above) the implementation of the 754 standard is based on only three of these components as is
shown by
Figure 2.5
.
Figure 2.6
provides a further explanation of the standard and how the various fields are
implemented.
The standard was not defined for 32 bits only, and it can be applied for other word sizes as well.
For example, in the 64-bits-per-word format, the mantissa is using 52 bits, and for the exponents 11
bits were allocated (see
Figure 2.7
). This means that the 64-bit numbers are more accurate since there
are much more digits to represent the number, and in addition, increasing the size of the exponent
provides an increased range of numbers.
FIGURE 2.5
The 754 standard.
FIGURE 2.6
The 754 standard formula.
FIGURE 2.7
The 754 standard for 64-bit words.
Due to the larger number of bits allocated for the exponent, the biased number was also changed.
In the 64-bits-per-word standard, the bias is 1023 (instead of the 127 that is used with 32-bit words).
For relevancy for other word sizes, this bias was defined in a general form as well:
where:
Range of Floating-Point Numbers
Unlike the circular nature of the integers (or natural numbers) represented in a computer (see the
section “Range of Numbers” in this chapter), the range of the floating-point numbers is significantly
different. Of the seven possible segments of numbers available in the range, floating-point numbers
can access only three. For a better explanation of the ranges and the segments, we will use a virtual
number system and a virtual floating-point format.
Assuming we have a six-digit decimal number, the applicable format will be as follows:
1. One digit represents the number’s sign (0 means positive and 1 means negative, while all other
values are invalid).
2. One digit is the significant part of the normalized number.
3. Two digits are allocated for the mantissa.
4. Two digits are allocated for the exponent with a bias of 49.
This means that the format of this system is
where:
S
is the sign
D.MM
is the normalized number (one significant digit and two digits for the mantissa)
e
is the exponent
Using this imaginary system, a number like 125 will be written as 012551. The steps that are
required to build this format are
1. Normalize the original number: 125 = 1.25 * 10
2
2. Then build the format’s exponent by adding the bias (49) to the real exponent: 2 + 49 = 51
3. Setting the sign a positive (S = 0)
The number −0.000571 will be written as 157145.
1. Normalize the original number: 0.000571 = 5.71 * 10
–4
2. Build the format’s exponent by adding the bias (49) to the real exponent: −4 + 49 = 45
3. Setting the sign a negative (S = 1)
Due to the inherent limitations of computer systems and the fact that the words have a definite
number of bits, the floating-point representation is limited and cannot address all the indefinite
numbers in a specific range.
For a better explanation, we will use an imaginary system that uses decimal numbers. The
exponent is based on two decimal digits, and the normalized significant part consists of three digits.
Figure 2.8
provides a visual explanation of the floating-point representation’s limitations.
From the seven segments shown in the numbers line, only three are accessible by the floating-
point representations. The two overflow segments represent numbers that are beyond the possible
representations. For example, since the maximum number that can be represented is 9.99*10
49
, then
it means that 10
52
is bigger than the maximum possible representation and thus cannot be expressed
by this imaginary system. Using 10
52
will cause a positive overflow. Similarly since the largest
negative number is −9.98*10
49
, then −10
52
cannot be represented either and it will cause a negative
overflow. There are two additional ranges (or segments) that are inaccessible. The underflow
numbers that are very small numbers (close to the zero) that are too small to be represented. For
example, the numbers with an absolute value smaller than 10
−52
.
FIGURE 2.8
Definition range example.
The only three accessible segments shown in the figure are
• Positive numbers in the range 9.99 * 10
−50
≤ X ≤ 9.99 * 10
49
• Negative numbers in the range −9.9 * 10
49
≤ X ≤ −9.99 * 10
−50
• Zero
It should be noted that when the exponent has only two digits, the total number that can be
represented is 100 (10
2
). One number is reserved for zero, which implies that the two sides, negative
and positive, have different magnitudes. Usually the standard will define how the exponent is
calculated, but in this specific example, the assumption was that the negative side has one additional
value. This is the reason it gets to 10
−50
while the positive side gets only to 10
49
. The 754 standard, like
the imaginary model described previously, provides access to only three of the segments of the
number line. The limits of these segments are different from the limits of the model that was
previously described and are based on the specific word size. The 32-bit standard defines 8 bits for
the exponent. This means that the maximum magnitude of the exponent is 256 (2
8
); however, only
254 of these values are used to represent floating-point numbers, while two values (zero and 255) are
reserved for special cases.
Special Numbers
Due to the limitations inherent in the floating-point mechanism, the used values for the exponent in
a 32-bit representation are 1–254. Due to the bias used in the formula, the possible exponents are
between minus 126 to 127 (−126 ≤ e ≤127). When the exponent field is all zeroes or all ones, this
denotes some special numbers:
1. A number with a zero exponent and a zero mantissa represents zero. This is a special number
that does not follow the 754 format.
2. A number with a zero exponent and a nonzero mantissa is a denormal (a number that is not a
normal one). This type is used for representing numbers that are in the underflow segment (or
very close to zero). The sign bit (the leftmost bit) defines it as a positive or negative underflow.
3. A number with an exponent where all bits are set (exponent value of 255) and a mantissa that is
all zeroes is used to represent infinite numbers. The sign is used to determine if it is −∞ or +∞.
4. A number that contains an exponent with all bits set and a mantissa that is not zero is used to
define a NaN (not a number). Trying to calculate the square root of a negative number, for
example, will produce a NaN. This is a special value used by the computer hardware in order to
save time in unnecessary checks. Since any operation on a NaN produces a NaN, then it is faster
to check if one operand is NaN, and then there is no need to carry out the calculation since the
result will be a NaN as well.
Converting 754 Numbers
Converting decimal numbers to 754 floating-point numbers is done using the formula
The conversion is a technical straightforward process that includes several structured steps:
1. Converting the decimal number to a binary fixed-point number.
2. Normalizing the binary number.
3. Calculating the floating-point exponent by adding the exponent of the normalized number to
the bias. Assuming this is a 32-bit number, the bias is 127.
4. Converting the calculated exponent to a binary number.
5. Defining the sign of the floating-point number—this is the sign of the original decimal number.
6. Constructing the floating-point number by integrating all three fields. This involves starting
with the sign (the leftmost bit—step 5 above) and then concatenating the 8 bits that represent
the exponent (calculated in step 4 above) and then concatenating the 23 bits that represent only
the fraction of the normalized number (step 2 above). If the fraction contains less than 23 bits,
zeroes will be appended on the right side.
7. For clarity, the last step is to convert the binary number into a hexadecimal number. This
reduces the number of digits from 32 bits to 8 hexadecimal digits. The conversion from binary
to hexadecimal is done by grouping each 4 bits into the hexadecimal corresponding digit (see
Table 2.3
for assistance).
For example, let us assume that we have to convert the decimal number (−0.75) into a floating-
point number.
1. First we will convert the decimal number to a binary number:
2. Then we have to normalize the binary number:
3. From the normalized exponent, we will calculate the exponent of the 754 floatingpoint number
by adding 127:
4. The calculated exponent is a decimal number, so it has to be converted to a binary number:
5. The sign is then defined according to the sign of the original decimal number. In this case, the
number is negative, so
6. With all parts defined, it is time to construct the floating-point number by integrating all the
Do'stlaringiz bilan baham: |