Concurrent Hash Tables & Radix Trees Under LLM Workloads

EE 361C · Multicore Algorithms · UT Austin · Spring 2026

Concurrent Hash Tables and Radix Trees
Under LLM-Derived Workloads

Nathan Lemma · Joshua Chikosha · Yonatan Teshome

OVERVIEW

When you're using a language model like ChatGPT, the server often recognizes the beginning of your prompt, like system instructions, common question patterns and shared context, before it's even finished typing. AI servers cache that repeated text so they don't have to recompute it every time. The problem is, many threads are accessing this cache at the same time, and if the data structure isn't right, it can become a major bottleneck.

We built 7 different data structures that all do the same job, fast lookups under heavy simultaneous access. We measured which is fastest. The answer turns out to depend heavily on what kind of traffic you throw at it.

The winner changes

On standard random-access benchmarks, one design (cuckoo hashing) wins easily. Switch to real ChatGPT conversation data (which has more writes mixed in) and a simpler design (fine-grained chaining) takes the lead

13x faster

We took the exact same tree structure and only changed how threads take turns accessing it. One version makes every thread wait in a single line. The other lets threads navigate freely. Result: 13x more throughput at 8 threads.

ShareGPT Tests

A design that looks great on random keys can be the worst choice for real AI traffic. You can't pick the data structure and the thread strategy independently; they interact, and both have to match the workload.

Best hash vs. best tree

Cuckoo optimistic hash winner

87.30Mops/s

peak · uniform random keys · 95% reads · 16 threads

LLM trace throughput29.04 Mops/s

p99 latency1.36 µs

read pathseqlock - no lock

write pathper-bucket + rare global

Radix fine tree winner

88.36Mops/s

peak · high key sharing · 95% reads · 8 threads

thread scaling3.81x (1T → 8T)

vs. coarse @ 8T13.47x faster

read pathatomic walk + leaf lock

write pathatomic walk + CAS + leaf lock

Hash table throughput Mops/s · 16 threads

Picking a workload changes the winner. Random keys favor cuckoo's lock-free reads; real ChatGPT traces have enough writes that fine-grained chaining pulls ahead.

Tail latency on the LLM trace: how slow is the slowest 1%?

Throughput can look close while tail latency diverges 40x. Two variants keep the 1-in-100 request under 2 µs; the other three stall 30–60 µs at the tail.

Radix tree throughput Mops/s · 8 threads · 95% reads

Same tree shape. Only the thread-coordination strategy differs. As keys share more prefixes (higher sharing), contention rises and the gap widens dramatically.

Does adding threads actually help? (1 thread → 8 threads)

Same tree, same data. Fine scales 3.8x as threads walk the tree independently. Coarse gets slower: threads pile up at the single lock. That's the entire 13x gap at 8 threads.

Fine: 21.5 → 81.9 Mops/s · 3.8x faster Coarse: 21.5 → 6.1 Mops/s · 3.5x slower

Cuckoo optimistic shown as reference. It's the hash winner, included to frame the tree numbers against the best hash table.