Benchmarking the cost of primitive operations on the JVM

(when are floating point operations faster than integer ones?)

I was listening to the interview with Andrei Alexandrescu - of C++ and D-lang fame (cheers homie! :-)) on CppCast when I heard something interesting: float divisions are supposed to be faster than integer ones on modern CPUs! So I had to dig into that.

The setup

I whipped up some JMH benchmarks (the sources are on GitHub) and ran them on three machines:

An older Xeon E5-2665
An i7-4600U
An i7-4750HQ

(Other factors - like memory, CPU count, exact JVM version, etc - are not that relevant to this discussion since this is strictly a single-threaded ALU benchmark)

The answers

You can find the exact numbers in the GitHub repository and I also created a Google Docs spreadsheet to better visualize the results.

What is the order (from a speed point of view) of integer operations?
They seem to fall into two categories div/mod and the rest, with the former being twice as slow as the later.
This seems to hold regardless of data type (byte / short / long / etc) or CPU (though the later CPUs are ~30% faster)
Is computation on smaller integers (ie. byte / short / char) faster / slower / the same than on larger (int / long)?
In general in is the fastest, followed by long and the other types (byte / char / short)
For modulo (%) long is particularly slow
The same question as above for floating point numbers: are calculations on float faster / slower / the same as on double?
For addition / subtraction they are the same speed (note that I only tested numbers which are of similar magnitude - so perhaps the results are different if rescaling is involved)
For division float was 30% to 50% faster (depending on the CPU model) than double (!)
The order of operations (in decreasing order of ops/s) seems to be:
- addition / subtraction
- multiplication (about the same speed)
- division (about 30% to 50% slower)
- modulo (30% to 50% slower than division)
- multiplication / division on denormal values - this is at least 50x (yes, times not percent!) slower
Back to the original question: can floating point division be faster than the integer one?
Amazingly yes! The benchmark shows that float division is ~30% faster than integer division (which is approximately the same speed as double division and both are 2x faster than long division)
This does not hold for related operations like multiplication (where they are approximately the same speed) and modulo (where int more than twice as fast as the others)

The answers - part II

So this is it, right? Use float division? Lets apply this optimization to some real-world problems:

And the results are:

For vector multiplication we indeed see a 2x to 5x performance gain when using float instead of int
However for interpolation search we see the opposite: ints are ~30% faster than floats

Why might that be?

My working theory is that individual floating point divisions are (slightly) slower than integer divisions. However: there are probably more ALUs in the CPU which can do floating point operations and if the operations can be pipelined (ie. they can be executed in parallel / there is no data dependency between them) the overall throughput is better.

In other cases - like with interpolation search - where we need the result of the individual division before we can progress, pipelineing doesn’t help.

Some conclusions

Do your own benchmarks! Don’t trust a page on the internet! Things can change considerably with new CPU architectures / JVMs!

Consider investing in hardware. There can be considerable performance gain in running on the latest vs. 3-4 year old hardware and it’s a simple upgrade
Store your data in int / float if it fits. Anything smaller hurts your arithmetic performance (but can help considerably if you’re bottlenecked on memory bandwidth)
Use the correct kind of data (floating point for simulations and integers for everything else - including monetary values which should be stored as multiples of 1/1000th of a cent)
Don’t use denormalized floating point numbers for multiplication / division if you can help it
Of course there is a lot more to floating point algorithms like numerical stability so thread lightly when mucking around with such algorithms
Using shifts / masking where appropriate (for example multiplying / dividing by powers of two) is still a very worthwhile optimization (the JVM/JIT does this automatically when it can “see” the constant part - ie. there is no need to write v >> 2 instead of v / 4)

The setup

The answers

The answers - part II

Some conclusions

Related Posts: