Basic usage

So you have attempted to perform a simple floating point calculation in Rust, perhaps an old classic:


#![allow(unused)]
fn main() {
let sum = 0.1 + 0.2;

if sum == 0.3 {
    println!("Decimal math is working as expected!");
} else {
    println!("Something has gone horribly wrong!");
}
}

And it turns out that:

Something has gone horribly wrong!

What's going on? Let's take a closer look at the result of the sum:


#![allow(unused)]
fn main() {
println!("0.1 + 0.2 = {}", sum);
}
0.1 + 0.2 = 0.30000000000000004

That doesn't seem like the right answer at all! It's very close, but why is it off by a tiny amount?

Well, what's happened is that the f64 type being used to calculate sum has a binary numeric representation, and our inputs are specified as decimal numbers. Neither 0.1 nor 0.2 in decimal can be represented exactly in binary and so they have been rounded to the nearest values which can be. The addition is performed using binary arithmetic and finally is converted back into a decimal representation to be printed.

We can see how these values have been rounded by printing them with a high enough precision:


#![allow(unused)]
fn main() {
println!("0.1 -> {:.55}", 0.1);
println!("0.2 -> {:.55}", 0.2);
}
0.1 -> 0.1000000000000000055511151231257827021181583404541015625
0.2 -> 0.2000000000000000111022302462515654042363166809082031250

Close, so very close

Very few decimal numbers may be exactly represented in binary. Only powers of two:


#![allow(unused)]
fn main() {
println!("4.0   -> {:.55}", 4.0);
println!("2.0   -> {:.55}", 2.0);
println!("1.0   -> {:.55}", 1.0);
println!("0.5   -> {:.55}", 0.5);
println!("0.25  -> {:.55}", 0.25);
println!("0.125 -> {:.55}", 0.125);
}
4.0   -> 4.0000000000000000000000000000000000000000000000000000000
2.0   -> 2.0000000000000000000000000000000000000000000000000000000
1.0   -> 1.0000000000000000000000000000000000000000000000000000000
0.5   -> 0.5000000000000000000000000000000000000000000000000000000
0.25  -> 0.2500000000000000000000000000000000000000000000000000000
0.125 -> 0.1250000000000000000000000000000000000000000000000000000

And sums of powers of two:


#![allow(unused)]
fn main() {
println!("{:.55}", 0.5 + 0.25);
println!("{:.55}", 4.0 + 0.5 + 0.125);
}
0.7500000000000000000000000000000000000000000000000000000
4.6250000000000000000000000000000000000000000000000000000

Where:

  • 4.000 = 22
  • 2.000 = 21
  • 1.000 = 20
  • 0.500 = 2-1
  • 0.250 = 2-2
  • 0.125 = 2-3

Most decimal numbers are not sums of powers of two1, so every time we convert a decimal into floating point there is a high chance that it will be rounded. The difference between the input value and the converted f64 floating point value is known as the initial roundoff error.

Accumulating errors

Conversion from decimal to binary is not the only source of error in floating point arithmetic. If the only problem was the initial roundoff error, then we could reasonably expect the sum 0.1 + 0.2 to be computed exactly and be equal to the converted constant 0.3 when we test it with ==.

It is not:


#![allow(unused)]
fn main() {
let sum = 0.1 + 0.2;

println!("sum -> {:.55}", sum);
println!("0.3 -> {:.55}", 0.3);
}
sum -> 0.3000000000000000444089209850062616169452667236328125000
0.3 -> 0.2999999999999999888977697537484345957636833190917968750

Why aren't they the same? Well, sum has been rounded up to the nearest representable f64 number, and 0.3 has been rounded down.

Let's think a bit more about the original inputs. They have both been rounded up compared to their decimal values:

0.1 -> 0.1000000000000000055511151231257827021181583404541015625
0.2 -> 0.2000000000000000111022302462515654042363166809082031250

If we add these rounded values together by hand, we get a number that is just slightly above 0.3:

  0.1000000000000000055511151231257827021181583404541015625
+ 0.2000000000000000111022302462515654042363166809082031250
= 0.3000000000000000166533453693773481063544750213623046775

The results of operations on floating point values are rounded to the nearest representable value2. It just so happens that in this example, the sum of the two rounded values is high enough to be rounded up, whereas 0.3 is low enough to be rounded down. This is why checking they are equal with == returns false.

Nearly every operation on floating point numbers can result in further rounding, amplifying the effect of previous rounding. You may be able to mitigate this somewhat by reordering operations to reduce the magnitude, but it is impossible to avoid error entirely.

Distant relatives

Wait a minute though. If our two input two values are exactly representable as binary numbers then why was the result of adding them together rounded at all? Shouldn't sum be the same as the answer we calculated by hand? It is somewhat higher than we expected:

by hand: 0.3000000000000000166533453693773481063544750213623046775
by f64:  0.3000000000000000444089209850062616169452667236328125000

They are different because of the fundamental design of floating point numbers. The big advantage of floating point types, and the reason that we may be happy to deal with the small errors inherent in using them, is that they can represent a literally astronomically larger range of absolute values than integer types can in the same number of bits.

The trade off is that the granularity of floating point numbers changes with their magnitude. As floating point numbers get larger the absolute distance between adjacent values grows, and as they get smaller it shrinks, whereas the relative distance between values remains constant. This is in contrast to integer types where, regardless of magnitude, adjacent values are always the same absolute distance away from each another (one), but their relative distance changes.

Let's make this more concrete by comparing i32 and f32. Each can represent 4,294,967,296 different values using their 32 bits, but they make very different choices as to what those values are.

The positive range of i32 can represent:


#![allow(unused)]
fn main() {
0
1
...
2147483646
2147483647 // i32::MAX
}

The positive range of f32 can represent:


#![allow(unused)]
fn main() {
0.0
...
0.000000000000000000000000000000000000011754944 // f32::MIN_POSITIVE
0.000000000000000000000000000000000000011754945
...
340282330000000000000000000000000000000.0
340282350000000000000000000000000000000.0 // f32::MAX
∞ // f32::INFINITY                                        
}

As mentioned above, the absolute distance between every adjacent pair of i32 numbers within its range is always one, but their relative distance changes depending on their magnitude:

0 -> 1:
absolute: 1 - 0 = 1
relative: 1 / 0 = ∞

1 -> 2:
absolute: 2 - 1 = 1
relative: 2 / 1 = 2

...

2147483646 -> 2147483647:
absolute: 2147483647 - 2147483646 = 1
relative: 2147483647 / 2147483646 = 1.0000000004656612873077392578125

The f32 number line is a bit more complex. Zero and infinity are special values at the extremes and the tiny range of subnormal values just above zero act differently but the vast majority of f32 values are normal floating point numbers in the range from f32::MIN_POSITIVE to f32::MAX.

The absolute distance between each pair of adjacent normal f32 values varies depending on their size but their relative distance is always very close to 1.0000001. We can illustrate this with some help from the ieee754 crate:


#![allow(unused)]
fn main() {
use ieee754::Ieee754;

let a = f32::MIN_POSITIVE;
let b = a.next();
println!("{} ->\n{}:", a, b);
println!("absolute: {}", b - a);
println!("relative: {}\n", b / a);

let c = f32::MIN_POSITIVE.next();
let d = c.next();
println!("{} ->\n{}:", c, d);
println!("absolute: {}", d - c);
println!("relative: {}\n", d / c);

println!("...\n");

let e = f32::MAX.prev();
let f = f32::MAX;
println!("{} ->\n{}:", e, f);
println!("absolute: {}", f - e);
println!("relative: {}", f / e);
}
0.000000000000000000000000000000000000011754944 ->
0.000000000000000000000000000000000000011754945:
absolute: 0.000000000000000000000000000000000000000000001
relative: 1.0000001

0.000000000000000000000000000000000000011754945 ->
0.000000000000000000000000000000000000011754946:
absolute: 0.000000000000000000000000000000000000000000001
relative: 1.0000001

...

340282330000000000000000000000000000000 ->
340282350000000000000000000000000000000:
absolute: 20282410000000000000000000000000
relative: 1.0000001

Machine epsilon

The absolute distance between adjacent numbers for a floating point type is determined by multiples of that type's machine epsilon. This is the distance between adjacent values in the range 1.0 to 2.0. For f64 this is the constant f64::EPSILON. We can scale it for every other range of powers of two to determine the absolute distance between numbers in those ranges:

  • 0.25 to 0.5 contains 8,388,608 values, all a distance of 0.25 * f64::EPSILON apart.
  • 0.5 to 1.0 contains 8,388,608 values, all a distance of 0.5 * f64::EPSILON apart.
  • 1.0 to 2.0 contains 8,388,608 values, all a distance of f64::EPSILON apart.
  • 2.0 to 4.0 contains 8,388,608 values, all a distance of 2.0 * f64::EPSILON apart.
  • 4.0 to 8.0 contains 8,388,608 values, all a distance of 4.0 * f64::EPSILON apart.

This is why our addition was unexpectedly rounded up:

  • 0.1 is in the range 0.0625 to 0.125, which are all 0.0625 f64::EPSILON apart.
  • 0.2 is in the range 0.125 to 0.25, which are all 0.125 * f64::EPSILON apart.
  • 0.1 + 0.2 is in the range 0.25 to 0.5, which are all 0.25 * f64::EPSILON apart.

The input values 0.1 and 0.2 were both rounded up to the nearest values in their respective ranges. The result of adding them together is within a range with a lower granularity and was not exactly representable, so it needed to be rounded to the nearest value that is.

It gets much worse

But that's not the end of the story.

There are far more significant sources of error in most numerical calculations than the roundoff errors we have described so far. If the inputs are real world values, then there is almost certainly some measurement error in how they were collected. Algorithms like physics simulations that use discrete steps to approximate continuous functions introduce truncation error. Even the underyling mathematics may amplify existing error if it is numerically unstable, for example when dividing large numbers by much smaller ones or if values undergo catastrophic cancellation.

I'll say that again, because it's so important: in most real world programs, floating point roundoff error is so small as to be insignificant compared to other sources of error. Floating point numerics may be inappropriate for values that have strict absolute precision requirements, such as currency, but in general there is no need to shy away from using them because of their rounding behaviour.

Every numerical algorithm is unique and there is no one size fits all solution or set of defaults to account for the error inherent in them and hide it away. In fact, there is an entire field of active mathematical research concerned with computational error, Numerical Analysis. What we can do, however, is learn to reason about these sources of error and provide tools for taking it into account and communicating our thoughts to future readers and maintainers.

Close enough

Bearing that in mind, let's return to our original comparison. Now that we know why the exact == comparison failed, we can instead check that the difference between the expected and actual values lies within a margin of error, known as the tolerance. This is what the float_eq! macro is for.

Absolute tolerance comparison

The simplest algorithm to check if two floating point numbers are equal is an absolute tolerance comparison. This tests that the absolute difference between two values lies within a specified tolerance and is invoked with the syntax abs <= tol. We have calculated that in our very simple example the values may differ by at most 0.25 * f64::EPSILON:


#![allow(unused)]
fn main() {
use float_eq::float_eq;

let sum = 0.1 + 0.2;

if float_eq!(sum, 0.3, abs <= 0.25 * f64::EPSILON) {
    println!("Floating point math is working as expected!");
} else {
    println!("Something has gone horribly wrong!");
}
}
Floating point math is working as expected!

Relative tolerance comparison

Hurray! However, manually scaling the tolerance to the range of the operands is not very elegant. Fortunately, given that we know that we are comparing two normal numbers, we can use a relative tolerance comparison to scale the tolerance for us. The second operand is our expected value, so we may choose to use r2nd <= tol. With a relative tolerance comparison, the tolerance should be specified as if we were testing a value in the range 1.0 to 2.0, so f64::EPSILON indicates we are expecting our operands to be no more than one representable value apart from each other3:


#![allow(unused)]
fn main() {
use float_eq::float_eq;
let sum = 0.1 + 0.2;
if float_eq!(sum, 0.3, r2nd <= f64::EPSILON) {
    println!("Floating point math is working as expected!");
} else {
    println!("Something has gone horribly wrong!");
}
}
Floating point math is working as expected!

ULPs based comparison

If both numbers are normal and the same sign, which is often the case, we can use an ULPs comparison, another form of relative check. This uses a property of the underlying representation of floating point numbers which means that when we interpret their bits as unsigned integers, the adjacent floats are the adjacent integer values above and below. By using ulps <= tol to invoke one of these checks, the tolerance is the maximum number of representable values apart they may be regardless of their magnitude. In our example, we know they may be at most one representable f64 value apart:


#![allow(unused)]
fn main() {
use float_eq::float_eq;
let sum = 0.1 + 0.2;
if float_eq!(sum, 0.3, ulps <= 1) {
    println!("Floating point math is working as expected!");
} else {
    println!("Something has gone horribly wrong!");
}
}
Floating point math is working as expected!

Asserting ourselves

The float_eq library also includes assert_float_eq! to accompany the boolean float_eq! operator. To illustrate their use and show some more advanced comparison techniques, we will define a very simple numerical integrator. This function takes an initial value and step size, then advances the value by an arbitrary number of steps:


#![allow(unused)]
fn main() {
fn integrate(initial: f64, step: f64, count: usize) -> f64 {
    let mut sum = initial;
    for _ in 0..count {
        sum += step;
    }
    sum
}
}

Say we want to unit test a number of different sets of input to this algorithm, we might build ourselves a wrapper for it:


#![allow(unused)]
fn main() {
fn test_integrate(initial: f64, step: f64, count: usize, expected: f64) {
    let actual = integrate(initial, step, count);
    assert_float_eq!(actual, expected, r2nd <= f64::EPSILON);
}
}

Note that the syntax for assert, r2nd <= tol is the same as for the boolean form, and that you may use any of the same algorithms if you wish. Arbitrarily, we have begun with the same tolerance as for our first example comparison. This means that the equivalent of our first simple comparison will pass this test just fine:


#![allow(unused)]
fn main() {
test_integrate(0.2, 0.1, 1, 0.3);
}

Drifting away

Let's add some tests which use an increasing step size. Our first two are within the existing expected tolerance:


#![allow(unused)]
fn main() {
test_integrate(0.0, 0.1, 1, 0.1);
test_integrate(0.0, 0.1, 10, 1.0);
}

But when we look at 100 steps, we find that our test fails:


#![allow(unused)]
fn main() {
test_integrate(0.0, 0.1, 100, 10.0);
}
thread 'main' panicked at 'assertion failed: `float_eq!(left, right, r2nd <= t)`
        left: `9.99999999999998`,
       right: `10.0`,
    abs_diff: `0.000000000000019539925233402755`,
   ulps_diff: `Some(11)`,
    [r2nd] t: `0.000000000000002220446049250313`', src\main.rs:15:9

This assert form prints out the values of the operands, like the standard Rust asserts, but it also provides additional context information to help us make sense of why it failed. Here, t is the tolerance:

  • abs_diff is the absolute difference between the two values: (left - right).abs().
  • ulps_diff is the difference in ULPs between the two values, the count of representable values they are apart. It may be None if two numbers have different signs.
  • [r2nd] t is the tolerance as scaled to the range of right (the second operand).

We can see that our actual and expected values are eleven ULPs from one another. That means that our tolerance of f64::EPSILON, equivalent to one ULP, is inadequate. This is because we are performing count number of additions, and each one of those provides an answer accurate to within 0.5 ULPs, so they have accumulated more error than a single step or ten steps would. We might reason that our tolerance should therefore be some function of the number of steps, perhaps (count * 0.5) * f64::EPSILON:


#![allow(unused)]
fn main() {
fn test_integrate(initial: f64, step: f64, count: usize, expected: f64) {
    let actual = integrate(initial, step, count);
    assert_float_eq!(
        actual,
        expected,
        r2nd <= ((count as f64) * 0.5) * f64::EPSILON
    );
}
}

Absolute zero

Let's take a look at another failure case, when we count backwards to zero:


#![allow(unused)]
fn main() {
test_integrate(10.0, -0.1, 100, 0.0);
}
thread 'main' panicked at 'assertion failed: `float_eq!(left, right, r2nd <= t)`
        left: `0.000000000000018790524691780774`,
       right: `0.0`,
    abs_diff: `0.000000000000018790524691780774`,
   ulps_diff: `Some(4401468191289638912)`,
    [r2nd] t: `0.0`', src\main.rs:15:9

Wow! There are a couple of things to note about this failure. The first is that because we are comparing versus zero, scaling our tolerance doesn't work - zero times anything is zero, so the tolerance does not take into account the step count. Even more noticable though is just how far away our actual value is in terms of representable numbers.

There are two reasons why this has happened. The first is that, as mentioned above, zero is a special number and does not have the same properties as normal floating point numbers. The second is that our final subtraction leaves us with a result many orders of magnitude smaller than the previous sum total, resulting in a catastrophic cancellation.

The solution here is to use a more sophisticated calculation for our tolerance, one that takes into account the nature of the calculation itself. It needs to be an absolute test, since we may be comparing versus zero. It should also scale relative to the largest intermediate value in the calculation and take into account the potential rounding errors from our repeated addition:


#![allow(unused)]
fn main() {
fn test_integrate(initial: f64, step: f64, count: usize, expected: f64) {
    let actual = integrate(initial, step, count);
    let half_count = (count as f64) * 0.5;
    let tol = f64::EPSILON * half_count * f64::max(initial.abs(), actual.abs());
    assert_float_eq!(actual, expected, abs <= tol);
}
}

This further illustates that there is no one right or general way to express the tolerances of numeric algorithms. Every comparison will be based on the specific calculation being performed and frequently the particular inputs.

Custom messages

The assert macros may include a custom message in the same manner as the standard Rust asserts:


#![allow(unused)]
fn main() {
assert_float_eq!(
    actual,
    expected,
    r2nd <= f64::EPSILON,
    "\nWhere: initial: {}, step: {}, count: {}",
    initial,
    step,
    count
);
}
thread 'main' panicked at 'assertion failed: `float_eq!(left, right, r2nd <= t)`
        left: `9.99999999999998`,
       right: `10.0`,
    abs_diff: `0.000000000000019539925233402755`,
   ulps_diff: `Some(11)`,
    [r2nd] t: `0.000000000000002220446049250313`:
Where: initial: 0, step: 0.1, count: 100', src\main.rs:14:9

Staying afloat

Hopefully that's given you some flavour of the issues that crop up when implementing numerical methods and how float_eq may aid you when they do. At this point you may be interested in learning how to perform some specific tasks, reading some more general background explanations or browsing the API documentation.


1

They are in fact all exact sums of multiples of powers of ten.

2

By default. Other rounding modes such as always rounding up, down or toward zero are available.

3

In general, n * fXX::EPSILON as a relative tolerance means "at most n representable values apart", for example you might use a tolerance of 4.0 * f64::EPSILON.