Notes from “What Every Computer Scientist Should Know About Floating Point”
In floating point, a number is represented as
d.ddd...d are called the significand (or mantissa, which is antiquated).
β is the base and
e is the exponent. The number
p is the precision.
Not all real numbers are exactly representable by a floating point number. For example, with β=2, there is no exact representation for 0.1.
Normalization and the Hidden Bit
A number may have multiple floating point represetnations. For example, if
p=2 we can represent
0.1xβ0 or as
1.0xβ -1. A floating point representation is said to be normalized if
1. Now, if working with only normalized representations of floating point numbers, we could then elide the explicit statement of
d0, and use
p-1 precision digits to represent a
p-precision number. The implicit
d0 in this case is called the implicit bit or hidden bit. The first use of this idea is attributed to Konrad Zuse.
As concrete examples, here are the formats defined by the IEEE 754 floating point standard:
[Single Precision Extended]
[Double Precision Extended]
The motivation for the particular ordering of the fields in the floating point format words stems from a desire by the IEEE 754 committee to enable fast operations such as search, on floating point numbers, using integer arithmetic. Thus when interpreted as integer values, larger floating point numbers whould be bigger, etc.
For the same reasons, the exponent is in biased notation, with a bias of 127 for single precision and a bias of 1023 for souble precision. This is so that negative exponents have a smaller value when interpreted as integers, than positive exponents. -1 is represented as (-1 in two’s complement + 127) = 01111110, whereas 1 is represented as (1 in two’s complement + 127) = 10000000.
The value of a particular floating point representation is therefore (-1)s x (1 + significand) x 2(Exponent – Bias), where the 1 added to the significand is to account for the hidden bit.
The results of a floating point computation (for a representation with
emax) may differ from the real number value. For example if the floating point representation of the result is
3.12x10-2, and the real-valued result should be
3.14159, then the error is
2 units in the last place or 2 ulps.
Another way of stating the error, the relative error is to state it as the magnitude of error divided by the magnitude of the correct real number representation. So in the above example, the relative error is
0.02159/3.14159 = 0.00687232.
Based on these definitions for the relative error and error in ulps, you can show that for an error in ulps of 0.5 the possible relative errors might differ by as much as a factor of β. This is termed wobble. So ulps are wobbly relative to relative error huh ?