When you're using a language model like ChatGPT, the server often recognizes the beginning of your prompt,
like system instructions, common question patterns and shared context, before it's even finished typing.
AI servers cache that repeated text so they don't have to recompute it every time. The problem is, many
threads are accessing this cache at the same time, and if the data structure isn't right, it can become a
major bottleneck.
We built 7 different data structures that all do the same job, fast lookups under heavy
simultaneous access. We measured which is fastest. The answer turns out to depend heavily on what kind
of traffic you throw at it.
Pick any of the 7 designs below to see how it works.
Two winners and how every variant stacks up across workloads.
Picking a workload changes the winner. Random keys favor cuckoo's lock-free reads; real ChatGPT traces have enough writes that fine-grained chaining pulls ahead.
Throughput can look close while tail latency diverges 40x. Two variants keep the 1-in-100 request under 2 µs; the other three stall 30–60 µs at the tail.
Same tree shape. Only the thread-coordination strategy differs. As keys share more prefixes (higher sharing), contention rises and the gap widens dramatically.
Same tree, same data. Fine scales 3.8x as threads walk the tree independently. Coarse gets slower: threads pile up at the single lock. That's the entire 13x gap at 8 threads.