Notes from “What Every Computer Scientist Should Know About Floating Point” February 22, 2006

Floating point is one way of representing real numbers in a fixed number of bits. Other approaches are fixed point, floating slash and signed logarithm representations.

In floating point, a number is represented as d.ddd...dxβe The p digits d.ddd...d are called the significand (or mantissa, which is antiquated). β is the base and e is the exponent. The number p is the precision.

Not all real numbers are exactly representable by a floating point number. For example, with β=2, there is no exact representation for 0.1.

Normalization and the Hidden Bit
A number may have multiple floating point represetnations. For example, if β=10, and p=2 we can represent 0.1 as 0.1xβ0 or as 1.0xβ -1. A floating point representation is said to be normalized if d0 is 1. Now, if working with only normalized representations of floating point numbers, we could then elide the explicit statement of
d0, and use p-1 precision digits to represent a p-precision number. The implicit d0 in this case is called the implicit bit or hidden bit. The first use of this idea is attributed to Konrad Zuse.

As concrete examples, here are the formats defined by the IEEE 754 floating point standard:
[Single Precision]
s e0e1e2e3e4e5e6e7f0f1f2f3f4f5f6f7f8f9f10f11f12f13f14f15f16f17f18f19f20f21f22

[Single Precision Extended]
[Double Precision]
[Double Precision Extended]

The motivation for the particular ordering of the fields in the floating point format words stems from a desire by the IEEE 754 committee to enable fast operations such as search, on floating point numbers, using integer arithmetic. Thus when interpreted as integer values, larger floating point numbers whould be bigger, etc.

For the same reasons, the exponent is in biased notation, with a bias of 127 for single precision and a bias of 1023 for souble precision. This is so that negative exponents have a smaller value when interpreted as integers, than positive exponents. -1 is represented as (-1 in two’s complement + 127) = 01111110, whereas 1 is represented as (1 in two’s complement + 127) = 10000000.

The value of a particular floating point representation is therefore (-1)s x (1 + significand) x 2(Exponent – Bias), where the 1 added to the significand is to account for the hidden bit.

Measuring Error
The results of a floating point computation (for a representation with p, β and emin and emax) may differ from the real number value. For example if the floating point representation of the result is 3.12x10-2, and the real-valued result should be 3.14159, then the error is 0.02, or 2 units in the last place or 2 ulps.

Another way of stating the error, the relative error is to state it as the magnitude of error divided by the magnitude of the correct real number representation. So in the above example, the relative error is 0.02159/3.14159 = 0.00687232.

Based on these definitions for the relative error and error in ulps, you can show that for an error in ulps of 0.5 the possible relative errors might differ by as much as a factor of β. This is termed wobble. So ulps are wobbly relative to relative error huh ?

Related Tags

    One Response to “Notes from “What Every Computer Scientist Should Know About Floating Point””

  1. mansour March 27th, 2006 at 11:59 am | Permalink

    sometimes I wonder if our language isn’t a floating point representation of our thoughts… ulps becomes a measure of the depth of thinking (number of steps one thinks before acting), and relative error how much they were wrong when compared with the real outcome. Hey, this starts to sound like a wobbly learning system.

Leave a Reply

You must be logged in to post a comment.

This entry was posted on Wednesday, February 22nd, 2006 at 9:43 am. You can follow any responses to this entry through the RSS 2.0 feed. If you're wondering how to get your own icon next to your comment, go visit and get yourself hooked up.
 steal compass, drive north, disappear...