Is hand-writing assembly still necessary these days?

Some time ago I came over the following article: Fast CRC32 in Assembly. It claimed that the assembly implementation was faster than the one implemented in C. Performance was always something I’m interested in, so I repeated and extended the experiment.

Here are the numbers I got. This is on a Core 2 Duo T5500 @ 1.66 Ghz processor. The numbers express Mbits/sec processed:

The assembly version from the blogpost (table taken from here): \~1700
Optimized C implementation (taken from the same source): \~1500. The compiler used was Microsoft Visual C++ Express 2010
Unoptimized C implementation (ie. Debug build): \~900
Java implementation using polynomials: \~100 (using JRE 1.6.0_23)
Java implementation using table: \~1900
Built-in Java implementation: \~1700
Javascript (for the fun of it) implementation (using the code from here with optimization – storing the table as numeric rather than string) on Firefox 4.0 Beta 10: \~80
Javascript on Chrome 10.0.648.18: \~40
(No IE9 test – they don’t offer it for Windows XP)

Final thoughts:

Hand coding assembly is not necessary in 99.999% (then again 80% of all statistics are made up :-p). Using better tools or better algorithms (see the “Java table based” vs. “Java polynomial”) can give just as good of performance improvement. Maintainability and portability (almost always) trump performance
Be pragmatic. Are you sure that your performance is CPU bound? If you are calculating a CRC32 of disk files, a gigabit per second is more than enough
Revisit your assumptions periodically (especially if you are dealing with legacy code). The performance characteristics of modern systems (CPUs) differ enormously from the old ones. I would wager that on an old CPU with little cache the polynomial version would have performed much better, but now that we have CPU caches measured in MB rather than KB the table one performs much better
Javascript engines are getting better and better.

Some other interesting remarks:

The source code can be found in my repo. Unfortunately I can’t include the C version since I managed to delete it by mistake :-(
The file used to benchmark the different implementations was a PDF copy of the Producing Open Source Software book
The HTML5 implementation is surprisingly inconsistent between Firefox and Chrome, so I needed to add the following line to keep them both happy: var blob = file.slice ? file.slice(start, len) : file;
The Javascript code doesn’t work unless it is loaded via the http(s) protocol. Loading it from a local file gives “Error no. 4”, so I used a small python webserver
Javascript timing has some issues, but my task took longer than 15ms, so I got stable measurements
The original post mentions a variation of the algorithm which can take 16 bits at one (rather than 8) which could result in a speed improvement (and maybe it can be extended to 32 bits)
Be aware of the “free” tools from Microsoft! This article would have been published sooner if it wasn’t for the fact MSVC++ 2010 Express require an online registration and when I had time I had no Internet access!
Update: If you want to run the experiment with GCC, you might find the following post useful: Intel syntax on GCC

Picture taken from the TheGiantVermin’s photostream with permission.

Related Posts: