Some time ago I came over the following article: Fast CRC32 in Assembly. It claimed that the assembly implementation was faster than the one implemented in C. Performance was always something I’m interested in, so I repeated and extended the experiment.
Here are the numbers I got. This is on a Core 2 Duo T5500 @ 1.66 Ghz processor. The numbers express Mbits/sec processed:
- The assembly version from the blogpost (table taken from here): \~1700
- Optimized C implementation (taken from the same source): \~1500. The compiler used was Microsoft Visual C++ Express 2010
- Unoptimized C implementation (ie. Debug build): \~900
- Java implementation using polynomials: \~100 (using JRE 1.6.0_23)
- Java implementation using table: \~1900
- Built-in Java implementation: \~1700
- (No IE9 test – they don’t offer it for Windows XP)
- Hand coding assembly is not necessary in 99.999% (then again 80% of all statistics are made up :-p). Using better tools or better algorithms (see the “Java table based” vs. “Java polynomial”) can give just as good of performance improvement. Maintainability and portability (almost always) trump performance
- Be pragmatic. Are you sure that your performance is CPU bound? If you are calculating a CRC32 of disk files, a gigabit per second is more than enough
- Revisit your assumptions periodically (especially if you are dealing with legacy code). The performance characteristics of modern systems (CPUs) differ enormously from the old ones. I would wager that on an old CPU with little cache the polynomial version would have performed much better, but now that we have CPU caches measured in MB rather than KB the table one performs much better
Some other interesting remarks:
- The source code can be found in my repo. Unfortunately I can’t include the C version since I managed to delete it by mistake :-(
- The file used to benchmark the different implementations was a PDF copy of the Producing Open Source Software book
- The HTML5 implementation is surprisingly inconsistent between
Firefox and Chrome, so I needed to add the following line to keep
them both happy:
var blob = file.slice ? file.slice(start, len) : file;
- The original post mentions a variation of the algorithm which can take 16 bits at one (rather than 8) which could result in a speed improvement (and maybe it can be extended to 32 bits)
- Be aware of the “free” tools from Microsoft! This article would have been published sooner if it wasn’t for the fact MSVC++ 2010 Express require an online registration and when I had time I had no Internet access!
- Update: If you want to run the experiment with GCC, you might find the following post useful: Intel syntax on GCC
Picture taken from the TheGiantVermin’s photostream with permission.