# Taming Floating Point Error

May 5, 2020

If you’ve been a software engineer for long enough, it is very likely that you’ve seen this example of floating point perfidy:

```
>>> 3.0 / 10
0.3
>>> 0.1 * 3
0.30000000000000004
```

We understand that this is due to the fact that floating point numbers, stored with only 64 bits of precision, cannot represent the entire real number line. Moreover, when we perform operations with these floating point numbers, the errors inherent in their representation can accumulate and multiply. The moral of the story is, never use a floating point number to represent money.

At least, that is the moral for financial applications. At Square, we
used a `long amount_cents`

, and we got along with our lives.
However, for most applications that have a good reason to use floating
point, this can’t be the end of the story. If floating point were the
unpredictable, unreliable thing that I once believed it to be, we
wouldn’t be able to numerically solve differential equations, or linear
systems, or land on the moon. Rather, there is a science of floating
point error, forming part of a science of numerical errors in general,
which seeks to tame and understand what happens to errors as they flow
through our calculations. In general, numerical error is something that
can be rather precisely quantified, as we’ll see. Along the way we’ll
look at the Fundamental Axiom of Floating Point Arithmetic, which, at
the very least, sounds way cooler than “10 things every developer should
know about floating point numbers”.

## Machine Epsilon

Fundamentally, error in floating point is due to the problem of “roundoff”. Just as in base 10, we cannot represent the number \(1/3\) without rounding it off somewhere:

in base 2, we cannot represent many numbers without rounding. Of
course, some numbers we *can* represent exactly. The number 1,
for example. Or any integer in the range \((-2^{53}, 2^{53})\). Also
notably, many fractions can be exactly represented:

However, a number like \(1/10\), just like \(1/3\) in base 10, must
be truncated to fit in the 24 or 53 bits of the mantissa. When we
enter `0.1`

in a console or in source code, the value that is
*actually stored* is very slightly different than
`0.1`

. According to this excellent IEEE-754
Floating Point Converter, the floating point number that is actually
stored (for 64-bit floating point) is

So the initial input to our calculation was flawed! We weren’t
calculating `0.1 * 3`

, we were actually calculating

`0.1000000000000000055511151231257827021181583404541015625 * 3`

How much of an error is this? We can get an idea by counting the zeros in between the significant digits, \(0.10000055\). In this case, there are 16. So in simply entering a number which is not representable exactly in floating point, we have incurred a relative error of roughly \(10^{-16}\).

Indeed, in *all* cases we can expect to incur a relative error
of roughly \(10^{-16}\). This magnitude is called machine epsilon, often
written \(_{}\). It comes from the relative difference between two
successive floating point numbers. For every representable floating
point number \(x\), there is a *next* floating point number, and
it is approximately \(x + _{} x\). So for an arbitrary real number
\(x_0\), it falls between two floating point values \(x\) and \(x + x\)
(leaving off the subscript of \(\) for conciseness). When we represent
\(x_0\) in floating point, we will get one of these two values. Let’s
denote the floating point representation of \(x_0\) by \((x_0)\). The
absolute error incurred just by representing \(x_0\) in floating point
is

The relative error, then, i.e. the absolute error divided by the true value, is

Cool! So we’ve seen that the worst we can do, in relative terms, when
representing a floating point number, is approximately \(10^{-16}\).
This is, for almost all practical purposes, *very good*. Because
remember, we’re speaking of a relative error. That means we’re able to
represent even very small values, very accurately. Here’s the nearest
floating point number to \(10^{-20}\):

If you’re curious, I invite you to count the 9’s. There are 16 of them. Even when dealing with extremely small numbers, we maintain the same relative precision.

## The Fundamental Axiom of Floating Point Arithmetic

Now you might be thinking, wait! It’s all good and well that we can
get excellent relative accuracy when *representing* floating
point numbers, but what about when we go to *do* something with
them? Here we’ve got two floating point numbers, both of which are
inexact, and we’re about to multiply them! Who knows what might
happen?

This concern is well-founded, because the algorithms of floating
point arithmetic must be implemented with finite precision. If we are
asked to multiply two large numbers with pen and paper, the algorithm
that most of us will use is the one we learned in school, which involves
lots of addition, carrying, sub-multiplications, and so on. If all of
those intermediate steps are using some kind of binary representation,
then the intermediate products may be losing precision as we go along!
This seems like a recipe for disaster. Fortunately, there is a property
that we can require of a floating point implementation, one which is
satisfied by IEEE-754 and most other popular floating point standards,
that will save us from total anarchy. This is what Trefethen and Bau,
*Numerical Linear Algebra*, refer to as the **Fundamental
Axiom of Floating Point Arithmetic**:

All floating point arithmetic operations are exact up to a relative error of \(_{}\).

This means that for any two floating point numbers, say \(x\) and \(y\), any operation involving them will give a floating point result which is within a factor of \(1 + _{}\) of the true result. This is easiest to understand if we use a special notation to represent the floating point version of, say, \(+\). Let’s write \(\) to denote floating point addition, and \(+\) to denote exact addition. Then the Fundamental Axiom tells us,

Remember, \(x\) and \(y\) are exactly representable as floating point
numbers, but of course they are also, mathematically, just real numbers.
So \(x + y\) is a real number (point on the number line), which *may
not be exactly representable in floating point*. The Fundamental
Axiom is telling us that the floating point operation \(\) can do no
worse than simply trying to represent the number \(x + y\) as a floating
point value directly, assuming our computer had access to the
mathematical, infinite precision object \(x + y\).

Put another way, we incur no extra error by going through floating point arithmetic than we would by using a magical computer to do exact arithmetic on our floating point values and then casting back to floating point. In mathematical terms,

Recall that \((a)\) is the mathematical number we actually represent when we try to represent \(a\) in floating point. So the first fraction gives the relative error from performing \(\) on the floating point representations of \(a\) and \(b\), and the second fraction gives the relative error from performing the exact arithmetic operation, \(+\), on the floating point representations of \(a\) and \(b\), then casting the result to floating point.

## Error Analysis

The Fundamental Axiom of Floating Point Arithmetic allows us to analyze the numerical error that may be incurred by even complex arithmetic operations. To see an idea of how this works, let’s consider the problem of computing the length of a 2D vector, \([x, y]\). Mathematically, this has the form

There are several stages to the computation, and at each one we will incur a little bit of numerical error:

- Enter the numbers \(x\) and \(y\) into the computer, incurring roundoff error.
- Compute \(x^2\) and \(y^2\), using floating point multiplication.
- Compute \(x^2 + y^2\), using floating point addition.
- Compute \(\), using the floating point square root operation.

At each step, the result we get will be equal to the true result, times some error factor \((1 + )\), where each \(\) is very small, on the order of \(_\). Each operation may have a different error \(\), but we’ll use the same symbol for all of them. We don’t care so much about the exact value of \(\), only that it is very small.

We’re going to carry these \(\) through the computation to see how they affect the final result. To make that easier, we can use special rules of arithmetic to manipulate \(\):

- \(^2 = 0\). If \(^{-16}\), then \(^2 ^{-32}\), which is so small that we just decide to completely ignore it.
- \((1 + )^2 = 1 + 2+ ^2 = 1 + 2\).
- \( = 1 + - + = 1 + \). This comes from doing a Taylor expansion of \(\) around the point 1. Another way to look at it is that, when \(x\) is very close to 1, \( x\), and the slope of the graph of \(\) at \(x = 1\) is \(1/2\).

With these rules, we’re ready to do the error calculation. Red expressions are the new error factor for each calculation, which appear in blue in the next expression down:

Phew! After carrying our error factors through all steps of the
calculation, we find that the total numerical error is at most 3 times
the machine precision. Keep in mind, this is an upper bound. We can’t
really say anything about what the actual error will *be*, only
that it is not greater than \(3 _{}\).

## A Cautionary Tale

The error analysis we did above was pretty carefree with manipulating
the magical error value \(\). We didn’t have to be too careful, because
multiplication and addition *of positive values* (like \(x^2\)
and \(y^2\)) is safe, using those rules. However, when we’re considering
subtraction, we have to be much more careful. Suppose \(X\) and \(Y\)
are very large numbers, but that they are very close together: \(X - Y\)
is small. Let’s try to compute \(X - Y\) in floating point, and see
where we get. Now we might be tempted to write,

However, we can’t pull out a factor of \(1 + \) from the subtraction! We have to remember that we are only pretending that \(\) is the same across these operations. In fact, to be more precise, we should write

to distinguish between the error values we get at different steps. Now, consider what happens if \(_Y = 0\), but \(_X \). Well, then,

We supposed above that \(X - Y\) was small, but \(X\) was large. Now
we have shown that the error incurred by subtracting the floating point
representations of \(X\) and \(Y\) is of the order \(_ X\), which may be
*much* larger than \(_\)! Here’s a Python snippet to
illustrate:

```
>>> (1e15 + 0.1) - 1e15
0.125
```

We should have gotten `0.1`

, but ended up getting
`0.125`

, an error of 25%! This effect is known as
“catastrophic cancellation”, and numerical analysts are very careful to
design their algorithms so as to avoid it.

## Learning More

The Fundamental Axiom of Floating Point Arithmetic gives us a
starting point to analyze numerical methods, but it is by no means the
whole story. The very best numerical algorithms have a magical property
called “backwards stability”, which essentially ensures that they
*exactly* solve a problem which is *almost* the correct
problem–the problem they solve exactly is within machine precision of
the problem you meant to solve. These algorithms are far more accurate
than they have any right to be. Once you start to use the algorithms of
numerical linear algebra to solve differential equations that evolve in
time, there is a whole science of ensuring that the roundoff error
incurred at each time step does not blow up exponentially. With these
tools, rather than be intimidated by numerical error, we can anticipate
and control it. While seeing your calculation give almost exactly the
right answer is a good feeling, there is something even more satisfying
about computing the error you got, and verifying that it, too, is
exactly as large as expected.

Everything I know about this stuff, I learned from the excellent book
*Numerical
Linear Algebra*, by Lloyd Trefethen and David Bau, as well as
the inimitable Anne
Greenbaum of the University of Washington.