Redis Clone: Virtual Threads (Project Loom)
In the last post we profiled the naive Redis Clone. One of thing showing up was the flushing of the response.
In this post we try out Virtual Threads and Structured Concurrency from the JDK 19 preview, and see if we can use it to improve on the flushing. This is certainly not a primary use case for these feature, but let’s try it out anyway.
First, I installed a JDK 19 preview (build 19-ea+34-2229)
.
Then I ran the existing app with it, to make sure the performance is still in the same ballpark:
ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 342161.69 --- --- 2.00145 1.47100 11.26300 59.13500 101542.08 Gets 3421575.73 1283516.83 2138058.90 1.99868 1.47100 11.26300 58.62300 457923.67 Waits 0.00 --- --- --- --- --- --- --- Totals 3763737.42 1283516.83 2138058.90 1.99893 1.47100 11.26300 58.62300 559465.75
I also checked how many threads are used. The app uses a thread per connection, therefore it uses over 256 threads (8 times 32). The extra threads are the JIT, GC, main, and other JVM internal threads:
$ jps > 1890 redis-experiment.jar $ ls -l /proc/1890/task/ | wc -l 290
Virtual Threads
Java is on its way to get virtual threads (also known as Project Loom). These are threads similar to Go routines. You can spin up 10'000nds, even millions of virtual threads. These then allows Java to keep its existing concurrency approach but scale it to way finer grained operations.
The first thing I did is to replace the existing thread pool executor with a virtual thread executor.
Otherwise, I keep the existing blocking code.
Additionally, in JDK 19, you also need to enable this preview feature by using the --enable-preview
flag.
var scheduler = Executors.newVirtualThreadPerTaskExecutor();
var socket = new ServerSocket();
socket.bind(new InetSocketAddress("0.0.0.0", 16379));
Then I ran that version. And the performance is about the same:
ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 353590.98 --- --- 1.96611 1.80700 4.19100 36.60700 104933.91 Gets 3535869.61 1356612.30 2179257.31 1.96678 1.80700 4.22300 36.60700 480863.88 Waits 0.00 --- --- --- --- --- --- --- Totals 3889460.58 1356612.30 2179257.31 1.96672 1.80700 4.22300 36.60700 585797.79
However, there is a big difference. It only used around 56 OS threads. The whole idea is that you can use a thread for each unit of work (like a web request) and have tons of them, without creating too many OS level threads.
$ jps > 3429 redis-experiment-classic.jar > 7059 Jps ls -l /proc/3429/task/ | wc -l > 56
Tyring Delayed Flushing with Structured Concurrency Constructs
Since virtual threads are cheap compared to OS-based threads, you can try to use them for small concurrent tasks. So, I experimented to see if I even can use virtual threads to do some overlapping IO operations. The main plan is this: Instead of always flushing, I delay the flushing. Then I try to read from the client. And only if there is no incoming work from the client, then the results are flushed. This way many pipelined client requests can be processed before the flushing.
I used the experimental StructuredTaskScope.ShutdownOnFailure
feature from JEP 428
for this. Add the command line option --add-modules jdk.incubator.concurrent
for that:
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
while (true) {
args.clear();
var lineReading = scope.fork(reader::readLine);
String line;
try{
// Try to get the next command with a short time
// If we get the data with in the timeout, then do not flush the writes.
// This way the writes are batched up of multiple commands
line = lineReading.get(2, TimeUnit.MICROSECONDS);
} catch (TimeoutException e){
// If we didn't get any command, flush out written data
scope.fork(()->{writer.flush(); return null;});
line = lineReading.get();
}
// Existing code ...
var reply = executeCommand(args);
if (reply == null) {
writer.write("$-1\r\n");
} else {
writer.write("$" + reply.length() + "\r\n" + reply + "\r\n");
}
// We flush later, when we repeat the loop
}
// Ensure we got all background task done
scope.join();
// Ensure we get any failures in background threads
scope.throwIfFailed();
}
And the performance is…Catastrophically bad, roughly one order of magnitude:
ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 42444.92 --- --- 16.39105 15.16700 49.91900 187.39100 12596.23 Gets 424402.63 25985.48 398417.16 16.44213 15.23100 50.17500 186.36700 23104.71 Waits 0.00 --- --- --- --- --- --- --- Totals 466847.55 25985.48 398417.16 16.43749 15.23100 50.17500 186.36700 35700.94
I didn’t investigate why the performance is so bad. One thing which is hideous is that the solution requires catching an exception, because the API has no way to wait for a timeout otherwise. That could also be part of the performance issue.
So, I went back and tried a simpler approach. What if we just do the flush in the background? I removed the delayed flushing and just did the flush in the background:
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
while (true) {
args.clear();
var line = reader.readLine();
// Original code
var reply = executeCommand(args);
if (reply == null) {
writer.write("$-1\r\n");
} else {
writer.write("$" + reply.length() + "\r\n" + reply + "\r\n");
}
scope.fork(()->{
writer.flush();
return null;
});
}
scope.join();
scope.throwIfFailed();
}
And the performance is…Awful! One order of magnitude slower!
ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 325339.40 --- --- 2.13675 1.89500 6.59100 39.67900 96549.78 Gets 3253351.21 1176211.82 2077139.38 2.14078 1.90300 6.59100 39.67900 424230.01 Waits 0.00 --- --- --- --- --- --- --- Totals 3578690.61 1176211.82 2077139.38 2.14041 1.90300 6.59100 39.67900 520779.79
Again, I didn’t investigate what destroys the performance. I have two guesses: - The classic Java IO has locks on its streams. Since we now flush in the background, we might create a lot of contention on a stream lock, making the performance worse. - Since we flush in the background, we might flush data on another physical core with cold caches, thus creating more cache / memory traffic between the cores.\ - Or it is something completely different ;)
Conclusion: For this tight IO loop, 'just smearing' virtual thread over it does not work.
Virtual Threads Still Help with More Connections
So, I rolled back to use the virtual threads only for the connection handling and left the flushing as is.
However, the virtual threads are still shining if more connections are coming in. I ran the benchmark with more connections (1024):
sudo docker run --name redisb --rm --network host memtier_benchmark \ --server=$SERVER_IP --port=16379 \ -t 8 -c 128 --test-time=30 --distinct-client-seed -d 256 --pipeline=30
With the classic thread pool, where each connection gets an OS-level thread, the latency starts to struggle:
ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 370192.21 --- --- 7.35006 5.66300 40.44700 127.99900 109860.62 Gets 3701749.16 1466524.81 2235224.35 7.34796 5.66300 40.44700 126.97500 515125.68 Waits 0.00 --- --- --- --- --- --- --- Totals 4071941.37 1466524.81 2235224.35 7.34815 5.66300 40.44700 126.97500 624986.29 # Amount of threads $ ls -l /proc/5985/task/ | wc -l > 1063
The virtual thread implementation has no trouble keeping up with more connections:
ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 345859.81 --- --- 8.06748 7.32700 27.00700 62.71900 102639.55 Gets 3458420.58 1306051.09 2152369.49 8.06871 7.35900 27.00700 62.46300 465058.51 Waits 0.00 --- --- --- --- --- --- --- Totals 3804280.39 1306051.09 2152369.49 8.06860 7.35900 27.00700 62.46300 567698.06 $ ls -l /proc/7112/task/ | wc -l > 56
Next Steps
Well, I still want to improve on the IO as discussed in the benchmark post. So, I’ll go down the Java NIO route and do some non-blocking IO.
Stay tuned ;)