# Taming Floating Point Error

If you’ve been a software engineer for long enough, it is very likely that you’ve seen this example of floating point perfidy:

```
>>> 3.0 / 10
0.3
>>> 0.1 * 3
0.30000000000000004
```

We understand that this is due to the fact that floating point numbers, stored with only 64 bits of precision,
cannot represent the entire real number line. Moreover, when we perform operations with these floating point numbers,
the errors inherent in their representation can accumulate and *multiply*. The moral of the story is, never use a
floating point number to represent money.

At least, that is the moral for financial applications. At Square, we used a `long amount_cents`

, and we got along with
our lives. However, for most applications that have a good reason to use floating point, this can’t be the end of the
story. If floating point were the unpredictable, unreliable thing that I once believed it to be, we wouldn’t be able to
numerically solve differential equations, or linear systems, or land on the moon. Rather, there is a science of floating
point error, forming part of a science of numerical errors in general, which seeks to tame and understand what happens
to errors as they flow through our calculations. In general, numerical error is something that can be rather precisely
quantified, as we’ll see. Along the way we’ll look at the Fundamental Axiom of Floating Point Arithmetic, which, at the
very least, sounds way cooler than “10 things every developer should know about floating point numbers”.

## Machine Epsilon

Fundamentally, error in floating point is due to the problem of “roundoff”. Just as in base 10, we cannot represent the number \(1/3\) without rounding it off somewhere:

\begin{equation*} \frac{1}{3} = 0.33333333333333 \dots \approx 0.33333333, \end{equation*}

in base 2, we cannot represent many numbers without rounding. Of course, some numbers we *can* represent exactly. The
number 1, for example. Or any integer in the range \((-2^{53}, 2^{53})\). Also notably, many fractions can be exactly
represented:

\begin{align*}
\frac{1}{2} &= 0.1_2 \\\

\frac{3}{4} &= 0.11_2 \\\

\frac{17}{8} &= 10.001_2 \\\

&\vdots
\end{align*}

However, a number like \(1/10\), just like \(1/3\) in base 10, must be truncated to fit in the 24 or 53 bits of the
mantissa. When we enter `0.1`

in a console or in source code, the value that is *actually stored* is very slightly
different than `0.1`

. According to this excellent IEEE-754 Floating Point Converter, the floating point number that is
actually stored (for 64-bit floating point) is

\begin{align*}
&(0.0001100110011001100110011001100110011001100110011001101)_2 = \\\

&(0.1000000000000000055511151231257827021181583404541015625)_{10}
\end{align*}

So the initial input to our calculation was flawed! We weren’t calculating `0.1 * 3`

, we were actually calculating

```
0.1000000000000000055511151231257827021181583404541015625 * 3
```

How much of an error is this? We can get an idea by counting the zeros in between the significant digits, \(0.100\dots 00055\). In this case, there are 16. So in simply entering a number which is not representable exactly in floating point, we have incurred a relative error of roughly \(10^{-16}\).

Indeed, in *all* cases we can expect to incur a relative error of roughly \(10^{-16}\). This magnitude is called machine
epsilon, often written \(\epsilon_{\text{machine}}\). It comes from the relative difference between two successive
floating point numbers. Unlike the number line, for every representable floating point number \(x\), there is a *next*
floating point number, and it is approximately \(x + \epsilon_{\text{machine}} x\). So for an arbitrary real number \(x_0\),
it falls between two floating point values \(x\) and \(x + \epsilon x\) (leaving off the subscript for conciseness). When we
represent \(x_0\) in floating point, we will get one of these two values. Denote the floating point representation
of \(x_0\) by \(\text{fl}(x_0)\). The absolute error is

\begin{equation*} e_{\text{abs}} = |\text{fl}(x_0) - x_0| \leq \max(|x_0 - x|, |x_0 - (x + \epsilon x)|) \leq |\epsilon x|. \end{equation*}

The relative error, then, i.e. the absolute error divided by the true value, is

\begin{equation*} e_{\text{rel}} = \frac{e_{\text{abs}}}{x_0} \leq \frac{|\epsilon x|}{x_0} \approx \epsilon. \end{equation*}

Cool! So we’ve seen that the worst we can do, in relative terms, when representing a floating point number, is
approximately \(10^{-16}\). This is, for almost all practical purposes, *very good*. Because remember, we’re speaking of
a relative error. That means we’re able to represent even very small values, very accurately. Here’s the nearest floating point
number to \(10^{-20}\):

\begin{equation*} 0.000000000000000000009999999999999999451532\dots \end{equation*}

If you’re curious, I invite you to count the 9’s. There are 16 of them. Even when dealing with extremely small numbers, we maintain the same relative precision.

## The Fundamental Axiom of Floating Point Arithmetic

Now you might be thinking, wait! It’s all good and well that we can get excellent relative accuracy when *representing*
floating point numbers, but what about when we go to *do* something with them? Here we’ve got two floating point
numbers, both of which are inexact, and we’re about to multiply them! Who knows what might happen?

This concern is well-founded, because the algorithms of floating point arithmetic must be implemented with finite
precision. If we are asked to multiply two large numbers with pen and paper, the algorithm that most
of us will use is the one we learned in school, which involves lots of addition, carrying, sub-multiplications, and so
on. If all of those intermediate steps are using some kind of binary representation, then the intermediate products may
be losing precision as we go along! This seems like a recipe for disaster. Fortunately, there is a property that we can
require of a floating point implementation, one which is satisfied by IEEE-754 and most other popular floating point
standards, that will save us from total anarchy. This is what Trefethen and Bau, *Numerical Linear Algebra*, refer to
as the **Fundamental Axiom of Floating Point Arithmetic**:

All floating point arithmetic operations are exact up to a relative error of \(\epsilon_{\text{machine}}\).

This means that for any two floating point numbers, say \(x\) and \(y\), any operation involving them will give a floating point result which is within a factor of \(1 + \epsilon_{\text{machine}}\) of the true result. This is easiest to understand if we use a special notation to represent the floating point version of, say, \(+\). Let’s write \(\oplus\) to denote floating point addition, and \(+\) to denote exact addition. Then the Fundamental Axiom tells us,

\begin{equation*} \frac{|(x \bigoplus y) - (x + y)|}{x + y} \leq \epsilon_{\text{machine}}. \end{equation*}

Remember, \(x\) and \(y\) are exactly representable as floating point numbers, but of course they are also, mathematically,
just real numbers. So \(x + y\) is a real number (point on the number line), which *may not be exactly representable in
floating point*. The Fundamental Axiom is telling us that the floating point operation \(\oplus\) can do no worse than
simply trying to represent the number \(x + y\) as a floating point value directly, assuming our computer had access to the
mathematical, infinite precision object \(x + y\).

Put another way, we incur no extra error by going through floating point arithmetic, than we would by using a magical computer to do exact arithmetic on our floating point values, and then casting back to floating point. In mathematical terms,

\begin{equation*} \frac{\left|(\text{fl}(a) \oplus \text{fl}(b)) - (a + b)\right|}{a + b} \approx \frac{\left|\text{fl}(\text{fl}(a) + \text{fl}(b)) - (a + b)\right|}{a + b} \approx \epsilon_{\text{machine}}. \end{equation*}

Recall that \(\text{fl}(a)\) is the mathematical number we actually represent, when we try to represent \(a\) in floating point. So the first fraction gives the relative error from performing \(\oplus\) on the floating point representations of \(a\) and \(b\), and the second fraction gives the relative error from performing the exact arithmetic operation, \(+\), on the floating point representations of \(a\) and \(b\), then casting the result to floating point.

## Error Analysis

The Fundamental Axiom of Floating Point Arithmetic allows us to analyze the numerical error that may be incurred by even complex arithmetic operations. To see an idea of how this works, let’s consider the problem of computing the length of a 2D vector, \([x, y]\). Mathematically, this has the form

\begin{equation*} d = \sqrt{x^2 + y^2}. \end{equation*}

There are several stages to the computation, and at each one we will incur a little bit of numerical error:

- Enter the numbers \(x\) and \(y\) into the computer, incurring roundoff error.
- Compute \(x^2\) and \(y^2\), using floating point multiplication.
- Compute \(x^2 + y^2\), using floating point addition.
- Compute \(\sqrt{x^2 + y^2}\), using the floating point square root operation.

At each step, the result we get will be equal to the true result, times some error factor \((1 + \epsilon)\), where each \(\epsilon\) is very small, on the other of \(\epsilon_\text{machine}\). Each operation may have a different error \(\epsilon\), but we’ll use the same symbol for all of them. We don’t care so much about the exact value of \(\epsilon\), only that it is very small.

We’re going to carry these \(\epsilon\) through the computation to see how they affect the final result. To make that easier, we can use special rules of arithmetic to manipulate \(\epsilon\):

- \(\epsilon^2 = 0\). If \(\epsilon \approx 10^{-16}\), then \(\epsilon^2 \approx 10^{-32}\), which is so small that we just decide to completely ignore it.
- \((1 + \epsilon)^2 = 1 + 2\epsilon + \epsilon^2 = 1 + 2\epsilon\).
- \(\sqrt{1 + \epsilon} = 1 + \frac{\epsilon}{2} - \frac{\epsilon^2}{8} + \cdots = 1 + \frac{\epsilon}{2}\). This comes
from doing a Taylor expansion of
`sqrt`

around the point 1. Another way to look at it is that, when \(x\) is very close to 1, \(\sqrt{x} \approx x\), and the slope of the graph of \(\sqrt{x}\) at \(x = 1\) is \(1/2\).

With these rules, we’re ready to do the error calculation. Red expressions are the new error factor for each calculation, which appear in blue in the next expression down:

\begin{align*}
\text{fl}(x) &= {\color{red}{(1 + \epsilon)}} x \\\

\text{fl}(y) &= {\color{red}{(1 + \epsilon)}} y \\\

\\\

\text{fl}(x) \otimes \text{fl}(x) &= (1 + \epsilon) \left[ {\color{blue}{(1 + \epsilon)}}^2 x^2 \right] = (1 + \epsilon)^3 x^2 \\\

&= (1 + 3\epsilon + 3\epsilon^2 + \epsilon^3) x^2 \\\

&= {\color{red}{(1 + 3\epsilon)}} x^2 \\\

\text{fl}(y) \otimes \text{fl}(y) &= (1 + \epsilon) \left[ {\color{blue}{(1 + \epsilon)}}^2 y^2 \right] = (1 + \epsilon)^3 y^2 \\\

&= (1 + 3\epsilon + 3\epsilon^2 + \epsilon^3) y^2 \\\

&= {\color{red}{(1 + 3\epsilon)}} y^2 \\\

\\\

[\text{fl}(x) \otimes \text{fl}(x)] \oplus [\text{fl}(y) \otimes \text{fl}(y)] &= (1 + \epsilon) \left[ {\color{blue}{(1 + 3\epsilon)}} x^2 + {\color{blue}{(1 + 3\epsilon)}} y^2 \right] \\\

&= (1 + \epsilon + 3\epsilon + 3\epsilon^2) [ x^2 + y^2 ] \\\

&= {\color{red}{(1 + 4\epsilon)}}(x^2 + y^2) \\\

\\\

\sqrt[\text{fl}]{ \text{fl}(x) \otimes \text{fl}(x) \oplus \text{fl}(y) \otimes \text{fl}(y)} &= (1 + \epsilon) \sqrt{{\color{blue}{(1 + 4\epsilon)}}(x^2 + y^2)} \\\

&= (1 + \epsilon)\sqrt{(1 + 4\epsilon)} \sqrt{x^2 + y^2} \\\

&= (1 + \epsilon)(1 + 2\epsilon) \sqrt{x^2 + y^2} \\\

&= {\color{red}{(1 + 3\epsilon)}} \sqrt{x^2 + y^2}
\end{align*}

Phew! After carrying our error factors through all steps of the calculation, we find that the total numerical
error is at most 3 times the machine precision. Keep in mind, this is an upper bound. We can’t really say anything
about what the actual error will *be*, only that it is not greater than \(3 \epsilon_{\text{machine}}\).

## A Cautionary Tale

The error analysis we did above was pretty carefree with manipulating the magical error value \(\epsilon\). We didn’t have
to be too careful, because multiplication and addition *of positive values* (like \(x^2\) and \(y^2\)) is safe, using those rules. However, when
we’re considering subtraction, we have to be much more careful. Suppose \(X\) and \(Y\) are very large numbers, but that
they are very close together: \(X - Y\) is small. Let’s try to compute \(X - Y\) in floating point, and see where we get.
Now we might be tempted to write,

\begin{align*}
\text{fl}(X) \ominus \text{fl}(Y) &= (1 + \epsilon) [ (1 + \epsilon) (X) - (1 + \epsilon) (Y) ] \\\

&= (1 + \epsilon)^2 (X - Y) \\\

&= (1 + 2\epsilon) (X - Y).
\end{align*}

However, we can’t pull out a factor of \(1 + \epsilon\) from the subtraction! We have to remember that we are only pretending that \(\epsilon\) is the same across these operations. In fact, to be more precise, we should write

\begin{align*}
\text{fl}(X) \ominus \text{fl}(Y) &= (1 + \epsilon_\ominus) [ (1 + \epsilon_X) (X) - (1 + \epsilon_Y) (Y) ] \\\

\end{align*}

to distinguish between the error values we get at different steps. Now, consider what happens if \(\epsilon_Y = 0\), but \(\epsilon_X \neq 0\). Well, then,

\begin{align*} (1 + \epsilon_X) (X) - (1 + \epsilon_Y) (Y) = X + \epsilon_X X - Y = (X - Y) + \epsilon_X X. \end{align*}

We supposed above that \(X - Y\) was small, but \(X\) was large. Now we have shown that the error incurred by subtracting
the floating point representations of \(X\) and \(Y\) is of the order \(\epsilon_\text{machine} X\), which may be *much*
larger than \(\epsilon_\text{machine}\)! Here’s a Python snippet to illustrate:

```
>>> (1e15 + 0.1) - 1e15
0.125
```

We should have gotten `0.1`

, but ended up getting `0.125`

, an error of 25%!
This effect is known as “catastrophic cancellation”, and numerical analysts are
very careful to design their algorithms so as to avoid it.

## Learning More

The Fundamental Axiom of Floating Point Arithmetic gives us a starting point to analyze numerical methods, but it is by
no means the whole story. The very best numerical algorithms have a magical property called “backwards stability”, which
essentially ensures that they *exactly* solve a problem which is *almost* the correct problem–the problem they solve
exactly is within machine precision of the problem you meant to solve. These algorithms are far more accurate than they
have any right to be. Once you start to use the algorithms of numerical linear algebra to solve differential equations
that evolve in time, there is a whole science of ensuring that the roundoff error incurred at each time step does not
blow up exponentially. With these tools, rather than be intimidated by numerical error, we can anticipate and control
it. While seeing your calculation give almost exactly the right answer is a good feeling, there is something even more
satisfying about computing the error you got, and verifying that it, too, is exactly as large as expected.

Everything I know about this stuff, I learned from the excellent book *Numerical Linear Algebra*, by Lloyd Trefethen and
David Bau, as well as the inimitable Anne Greenbaum of the University of Washington.