One Billion Row Challenge: Learned So Far
Last Update 2024-02-04, see below
I participate in the One Billion Row Challenge by Gunnar Morling: Parse one billion rows of CSV, in plain Java, and be fast at it. It is a friendly completion and learning experience.
I had three goals:
Be faster than the 'naive' reference solution.
Experiment a bit more with the new Memory access API’s
Experiment the first time with the new SIMD APIs.
My basic idea was this:
Memory map the file as a MemorySegment.
Instead of looking at one character at a time, use SIMD instructions to scan a bunch of characters in bulk.
Avoid copying memory into the Java heap.
I mostly achieved that. Here are my few surprises:
Bugs! Or try out Preview Features
The two APIs I’ve used are in preview. So, I did encounter two bugs with them in JDK 21.
I had JVM SIGSEGV crashes. I was quite clear to me, that I miscalculated some indexes. However, the JVM should raise an exception and crash with a SIGSEGV =). The bug is a missing bound check in VectorSpecies.fromMemorySegment when it is compiled by the JVMs C2 compiler. See my reproduction case on the mailing list if you are curious.
More evil: The MemorySegment.mismatch
method has a bug if you compare different
regions of the same memory segment: It always returns -1
and doesn’t do the actual comparison.
See my reproduction case on the mailing list if you are curious.
Conclusion: Try out preview features =). Squashing silly bugs or API quirks before it is a final feature certainly helps the JDK overall.
Thinking in SIMD is a good exercise:
Trying to use SIMD instructions is a good exercise for performance overall. My actual code isn’t specially fast nor uses SIMD well, but thinking in 'bulk' operations to solve a problem was fun.
As an industry, we should think way more in 'bulk' operations for all our APIs. I see so many performance issues because APIs are designed with a 'one thing at a time' focus. A 'do this thing in bulk approach', makes things orders of magnitude slower than they need to be.
Think in bulk operations, at every level! Computers & networks are good at that.
MemorySegment.get vs .getAtIndex
Because I tried to do bulk operations I read text as 32bit ints chunks instead of individual 8 bit bytes in parts of my code. Then I had a puzzling bug where I got the "wrong" result with my 32bit approach.
Until I re-read the docs of the .getAtIndex method I used. This method uses the specified index as an 'int-array' index, not a raw memory offset. In contrast to the [https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/foreign/MemorySegment.html#get(java.lang.foreign.ValueLayout.OfInt,long)] method that reads from that exact memory offset. For example: I thought I was reading e memory-offset 42, but I actually read the int-array index 42 resulting in memory-offset 42*4.
Updates since 12th January
2024-01-13: Parallel Processing added
I’ve added parallel processing to my solution. It is now under 10 seconds, compared to the 4 Minutes and 40 seconds of the reference implementation.
2024-01-24: A stupid Hash Helps
I simplified the hash for the key to mind-boggling stupid. Basically, if text is 4 bytes or more: interpret these bytes as int hash it with the length. For smaller keys, do a more regular hash function. This dropped me down from about 8+ seconds to around 5+ seconds. I also simplified the code to use more regular Java classes for the hash table entries. The performance didn’t suffer as much. The extra memory indirection for reference to that Value object didn’t hurt as much as I worried.
2024-01-28: Disabling GC and Failed Experiments
On my machine using the Epsilon GC aka no-GC saved some more time and brought it below 5 seconds in most runs. Disabling the GC is probably not suitable for 99% of all applications. However, for batch apps with short runtime or apps with extremely controlled allocations, Epsilon GC can be an option.
I also experimented more with the MemoryLayout and VarHandles. My results ended in horrible performance. I didn’t investigate why exactly.
Further, I experimented to cram the small keys directly into the hash table and do a simplified equals check. That didn’t work out, as the more complex code resulted in worse results.
The last thing I tried was to do some profiling. However, I failed to get a fined-grained enough details with assembly level info. I was not familiar enough with the tooling and I had to stop at that point.
2024-02-04: Different Run Results
Gunner included my run in the 10K and 32core runs. These where added later as interesting points.
Very interesting is the 32 Core data. My solution is actually slower than running on 8 cores with around 11 seconds. I guess I’ve some unintended contention on something.
A smaller surprise is that my solution isn’t fast on the 10K
key set. My two guesses for why:
- My equality check is vectorized, but only for small keys. I was to lazy to make it complete for longer keys.
And that dataset has longer keys.
- My hash function fast but horrible. With more keys, the collisions costs will kick in.
Final words
So, what is my performance (Roman Stoffel, around 4.4 seconds mark) after all this low level fiddling? It is more 50 times faster than the original naive implementation. While the current fastest implementation is about 4 times faster than mine.
I had a great time experimenting with all these new APIs.