Chengshuo Dai

The deployment of Large Language Models (LLMs) in production environments is fundamentally constrained by the memory bandwidth and capacity of GPUs. During autoregressive generation, an LLM predicts the next token in a sequence based on all preceding tokens. To avoid recalculating the attention scores for the entire sequence at every step, inference engines utilize a Key-Value (KV) Cache. This cache stores the intermediate Key and Value tensors for all previously processed tokens, allowing the model to only compute the attention for the newly generated token. While this drastically reduces computational complexity from $O(N^2)$ to $O(N)$ per step, it shifts the bottleneck from compute to memory.

The KV Cache grows linearly with the sequence length and the batch size. For a large model processing a long document, the KV Cache can quickly consume tens of gigabytes of VRAM, often exceeding the memory required to store the model weights themselves. This severely limits the maximum batch size, leading to low GPU utilization and high inference costs. Furthermore, traditional deep learning frameworks allocate memory for the KV Cache contiguously. Because the final length of a generated sequence is unknown at the start of the request, these frameworks must pre-allocate the maximum possible memory for each request to prevent out-of-memory errors. This leads to massive internal memory fragmentation, where up to 60-80% of the allocated VRAM is wasted on reserved but unused space.

To solve this critical inefficiency, researchers developed PagedAttention, the core technology behind the highly popular vLLM inference engine. PagedAttention draws inspiration from the virtual memory and paging systems used in modern operating systems. Instead of allocating a single, contiguous block of memory for a request's KV Cache, PagedAttention divides the cache into fixed-size blocks (e.g., blocks that can store the KV tensors for 16 tokens). These blocks do not need to be contiguous in physical GPU memory.

When a request arrives, vLLM allocates a small number of physical blocks. As the model generates tokens and fills these blocks, vLLM dynamically allocates new physical blocks on demand, mapping them to the request's logical sequence via a block table. This eliminates internal fragmentation completely, as memory is only consumed when actually needed. Furthermore, PagedAttention enables efficient memory sharing. If multiple requests share the same prompt (e.g., in parallel sampling or beam search), they can share the same physical blocks for the prompt's KV Cache, only diverging and allocating new blocks when their generated outputs differ. This drastically increases the maximum batch size and throughput of LLM serving systems.

While PagedAttention optimizes memory allocation within a single batch, it does not address the redundancy across different requests over time. In many applications, such as multi-turn chatbots or coding assistants, users frequently send requests that share a large common prefix (e.g., a long system prompt, a retrieved document, or the previous conversation history). Recomputing the KV Cache for this shared prefix on every new request is highly inefficient.

To address this, researchers introduced RadixAttention, implemented in the SGLang framework. RadixAttention extends the concept of KV Cache sharing across the entire lifetime of the inference server. It maintains a global Radix Tree (a space-optimized trie data structure) in the GPU memory to manage the KV Cache of all processed requests. Each node in the tree represents a sequence of tokens, and the path from the root to a node corresponds to a specific prefix.

When a new request arrives, the server searches the Radix Tree for the longest matching prefix. If a match is found (e.g., the system prompt and the first three turns of a conversation), the server immediately reuses the cached KV tensors for that prefix, skipping the expensive prefill computation entirely. It only computes the attention for the new tokens appended to the prefix. When the GPU memory fills up, RadixAttention employs an LRU (Least Recently Used) eviction policy to remove the least frequently accessed nodes from the tree. By treating the GPU memory as a persistent, prefix-aware cache, RadixAttention significantly reduces latency and increases throughput for applications with high prompt overlap, representing the next frontier in LLM inference optimization.

References:

vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention - https://arxiv.org/abs/2309.06180
SGLang: Efficient Execution of Structured Language Model Programs - https://arxiv.org/abs/2312.07104