Notes from “What Every Computer Scientist Should Know About Floating Point” February 22, 2006

Floating point is one way of representing real numbers in a fixed number of bits. Other approaches are fixed point, floating slash and signed logarithm representations.

In floating point, a number is represented as `d.ddd...dxβe` The `p` digits `d.ddd...d` are called the significand (or mantissa, which is antiquated). `β` is the base and `e` is the exponent. The number `p` is the precision.

Not all real numbers are exactly representable by a floating point number. For example, with β=2, there is no exact representation for 0.1.

Normalization and the Hidden Bit
A number may have multiple floating point represetnations. For example, if `β=10`, and `p=2` we can represent `0.1` as` 0.1xβ0` or as `1.0xβ -1`. A floating point representation is said to be normalized if `d0` is `1`. Now, if working with only normalized representations of floating point numbers, we could then elide the explicit statement of
`d0`, and use `p-1` precision digits to represent a `p`-precision number. The implicit `d0` in this case is called the implicit bit or hidden bit. The first use of this idea is attributed to Konrad Zuse.

As concrete examples, here are the formats defined by the IEEE 754 floating point standard:
[Single Precision]
s e0e1e2e3e4e5e6e7f0f1f2f3f4f5f6f7f8f9f10f11f12f13f14f15f16f17f18f19f20f21f22

[Single Precision Extended]
[Double Precision]
[Double Precision Extended]

The motivation for the particular ordering of the fields in the floating point format words stems from a desire by the IEEE 754 committee to enable fast operations such as search, on floating point numbers, using integer arithmetic. Thus when interpreted as integer values, larger floating point numbers whould be bigger, etc.

For the same reasons, the exponent is in biased notation, with a bias of 127 for single precision and a bias of 1023 for souble precision. This is so that negative exponents have a smaller value when interpreted as integers, than positive exponents. -1 is represented as (-1 in two’s complement + 127) = 01111110, whereas 1 is represented as (1 in two’s complement + 127) = 10000000.

The value of a particular floating point representation is therefore (-1)s x (1 + significand) x 2(Exponent – Bias), where the 1 added to the significand is to account for the hidden bit.

Measuring Error
The results of a floating point computation (for a representation with `p`, `β` and `emin` and `emax`) may differ from the real number value. For example if the floating point representation of the result is `3.12x10-2`, and the real-valued result should be `3.14159`, then the error is `0.02`, or `2` units in the last place or 2 ulps.

Another way of stating the error, the relative error is to state it as the magnitude of error divided by the magnitude of the correct real number representation. So in the above example, the relative error is `0.02159/3.14159 = 0.00687232`.

Based on these definitions for the relative error and error in ulps, you can show that for an error in ulps of 0.5 the possible relative errors might differ by as much as a factor of β. This is termed wobble. So ulps are wobbly relative to relative error huh ?

 Tags C {rss} Floating Point Representations {rss} Mathematics {rss} Conversation Technorati Cosmos Feedster Bloglines Related Tags Comments view comments Post comment Trackback Trackback URI Logged in users can add tags Create an account here Search Gemüsehaken.org

One Response to “Notes from “What Every Computer Scientist Should Know About Floating Point””

1. mansour March 27th, 2006 at 11:59 am | Permalink sometimes I wonder if our language isn’t a floating point representation of our thoughts… ulps becomes a measure of the depth of thinking (number of steps one thinks before acting), and relative error how much they were wrong when compared with the real outcome. Hey, this starts to sound like a wobbly learning system.