## Notes from “What Every Computer Scientist Should Know About Floating Point” February 22, 2006

Floating point is one way of representing *real numbers* in a fixed number of bits. Other approaches are fixed point, floating slash and signed logarithm representations.

In floating point, a number is represented as `d.ddd...dxβ`

The ^{e}`p`

digits `d.ddd...d`

are called the *significand* (or *mantissa*, which is antiquated). `β`

is the *base* and `e`

is the *exponent*. The number `p`

is the *precision*.

Not all real numbers are exactly representable by a floating point number. For example, with β=2, there is no exact representation for *0.1*.

**Normalization and the Hidden Bit**

A number may have multiple floating point represetnations. For example, if

`β=10`

, and `p=2`

we can represent `0.1`

as` 0.1xβ`^{0}

or as `1.0xβ`^{ -1}

. A floating point representation is said to be normalized if `d`_{0}

is `1`

. Now, if working with only normalized representations of floating point numbers, we could then elide the explicit statement of`d`_{0}

, and use `p-1`

precision digits to represent a `p`

-precision number. The implicit `d`_{0}

in this case is called the *implicit bit*or

*hidden bit*. The first use of this idea is attributed to Konrad Zuse.

As concrete examples, here are the formats defined by the IEEE 754 floating point standard:

**[Single Precision]**

*s* *e*_{0}*e*_{1}*e*_{2}*e*_{3}*e*_{4}*e*_{5}*e*_{6}*e*_{7}*f*_{0}*f*_{1}*f*_{2}*f*_{3}*f*_{4}*f*_{5}*f*_{6}*f*_{7}*f*_{8}*f*_{9}*f*_{10}*f*_{11}*f*_{12}*f*_{13}*f*_{14}*f*_{15}*f*_{16}*f*_{17}*f*_{18}*f*_{19}*f*_{20}*f*_{21}*f*_{22}

**[Single Precision Extended]**

**[Double Precision]**

**[Double Precision Extended]**

The motivation for the particular ordering of the fields in the floating point format words stems from a desire by the IEEE 754 committee to enable fast operations such as search, on floating point numbers, using integer arithmetic. Thus when interpreted as integer values, larger floating point numbers whould be bigger, etc.

For the same reasons, the exponent is in *biased notation*, with a bias of *127* for single precision and a bias of *1023* for souble precision. This is so that negative exponents have a smaller value when interpreted as integers, than positive exponents. *-1* is represented as (*-1* in two’s complement + *127*) = *01111110*, whereas *1* is represented as (*1* in two’s complement + *127*) = *10000000*.

The value of a particular floating point representation is therefore **( -1)^{s} x (1 + significand) x 2^{(Exponent – Bias)}**, where the

*1*added to the significand is to account for the hidden bit.

**Measuring Error**

The results of a floating point computation (for a representation with `p`

, `β`

and `e`

and _{min}`e`

) may differ from the real number value. For example if the floating point representation of the result is _{max}`3.12x10`

, and the real-valued result should be ^{-2}`3.14159`

, then the error is `0.02`

, or `2`

**u**nits in the **l**ast **p**lace or *2 ulps*.

Another way of stating the error, the *relative error* is to state it as the magnitude of error divided by the magnitude of the correct real number representation. So in the above example, the relative error is `0.02159/3.14159 = 0.00687232`

.

Based on these definitions for the relative error and error in ulps, you can show that for an error in ulps of 0.5 the possible relative errors might differ by as much as a factor of β. This is termed *wobble*. So *ulps* are *wobbly* relative to *relative error* huh ?

## One Response to “Notes from “What Every Computer Scientist Should Know About Floating Point””

mansourMarch 27th, 2006 at 11:59 am | Permalinksometimes I wonder if our language isn’t a floating point representation of our thoughts… ulps becomes a measure of the depth of thinking (number of steps one thinks before acting), and relative error how much they were wrong when compared with the real outcome. Hey, this starts to sound like a wobbly learning system.