EE 361C · Multicore Algorithms · UT Austin · Spring 2026
Concurrent Hash Tables and Radix Trees
Under LLM-Derived Workloads
Nathan Lemma  ·  Joshua Chikosha  ·  Yonatan Teshome
OVERVIEW

When you're using a language model like ChatGPT, the server often recognizes the beginning of your prompt, like system instructions, common question patterns and shared context, before it's even finished typing. AI servers cache that repeated text so they don't have to recompute it every time. The problem is, many threads are accessing this cache at the same time, and if the data structure isn't right, it can become a major bottleneck.

We built 7 different data structures that all do the same job, fast lookups under heavy simultaneous access. We measured which is fastest. The answer turns out to depend heavily on what kind of traffic you throw at it.

KEY FINDINGS
The winner changes
On standard random-access benchmarks, one design (cuckoo hashing) wins easily. Switch to real ChatGPT conversation data (which has more writes mixed in) and a simpler design (fine-grained chaining) takes the lead
13x faster
We took the exact same tree structure and only changed how threads take turns accessing it. One version makes every thread wait in a single line. The other lets threads navigate freely. Result: 13x more throughput at 8 threads.
ShareGPT Tests
A design that looks great on random keys can be the worst choice for real AI traffic. You can't pick the data structure and the thread strategy independently; they interact, and both have to match the workload.
THE STRUCTURES

Pick any of the 7 designs below to see how it works.

Hash tables · 5 variants
Radix trees · 2 variants
PERFORMANCE

Two winners and how every variant stacks up across workloads.

Best hash vs. best tree
Cuckoo optimistic hash winner
87.30Mops/s
peak · uniform random keys · 95% reads · 16 threads
LLM trace throughput29.04 Mops/s
p99 latency1.36 µs
read pathseqlock - no lock
write pathper-bucket + rare global
Radix fine tree winner
88.36Mops/s
peak · high key sharing · 95% reads · 8 threads
thread scaling3.81x (1T → 8T)
vs. coarse @ 8T13.47x faster
read pathatomic walk + leaf lock
write pathatomic walk + CAS + leaf lock
Hash table throughput Mops/s · 16 threads

Picking a workload changes the winner. Random keys favor cuckoo's lock-free reads; real ChatGPT traces have enough writes that fine-grained chaining pulls ahead.

Tail latency on the LLM trace: how slow is the slowest 1%?

Throughput can look close while tail latency diverges 40x. Two variants keep the 1-in-100 request under 2 µs; the other three stall 30–60 µs at the tail.

Radix tree throughput Mops/s · 8 threads · 95% reads

Same tree shape. Only the thread-coordination strategy differs. As keys share more prefixes (higher sharing), contention rises and the gap widens dramatically.

Does adding threads actually help? (1 thread → 8 threads)

Same tree, same data. Fine scales 3.8x as threads walk the tree independently. Coarse gets slower: threads pile up at the single lock. That's the entire 13x gap at 8 threads.

Fine: 21.5 → 81.9 Mops/s  ·  3.8x faster Coarse: 21.5 → 6.1 Mops/s  ·  3.5x slower
Cuckoo optimistic shown as reference. It's the hash winner, included to frame the tree numbers against the best hash table.