There are two different problems in large language models. The first is making them smarter. The second is making them fast enough to actually use. Most of the news covers the first problem. The second one has been getting solved just as aggressively, with much less fanfare.

In the last twelve months, the gap between what a model can do and what it costs to run one has closed dramatically. Not because of new chips. Because of better algorithms.

This is the story of where those improvements came from.

The Bottleneck Nobody Talks About

When a language model generates text, it does not just re-read the entire conversation from scratch every single token. It caches the results of previous computation in something called the KV cache, short for key-value cache.

The idea is simple. During attention, every token has a key vector and a value vector. When the model generates the 500th token in a response, it needs to compare the current query against all 499 previous keys and then pull the corresponding values. Those keys and values would be expensive to recompute every step, so the model stores them in GPU memory and reads them back.

That cache grows linearly with context length. A 7 billion parameter model with a 128k context window can easily need 16 to 32 gigabytes just for the cache. For reference, a consumer GPU with 24GB of VRAM would have almost nothing left for the actual model weights.

This is why longer contexts, faster responses, and larger models running locally have all historically been constrained not by compute but by memory bandwidth and VRAM capacity. The KV cache is the wall.

Compressing the Cache: From Simple Quantization to Rotation Tricks

The obvious first attempt at fixing this was quantization. Instead of storing each KV value as a 16-bit float, store it in 4 bits or even 2 bits. The math works out to a 4x or 8x reduction in cache size. Simple.

The problem is that the KV cache does not compress well with standard quantization. A typical approach like rounding values to the nearest 4-bit integer works great for model weights, where the distribution of values is predictable and well-behaved. KV activations are spikier. Outliers in a few dimensions can dominate the rounding error and degrade output quality noticeably.

The research community spent most of 2024 iterating on this. Papers like KIVI showed that quantizing keys per channel and values per token (rather than treating them identically) already improved things considerably. At 4-bit you could reduce cache memory by roughly 2.5x with minimal perplexity loss. At 2-bit things got messier but were increasingly viable for some use cases.

Then Google published something more interesting.

Rotate First, Then Quantize

The key insight behind TurboQuant, and the rotation-based approaches that followed it, is that the outlier problem is not inherent to the data. It is a consequence of coordinate choice.

When you store a KV vector in standard basis coordinates, some dimensions carry most of the information and some carry almost none. Quantization treats all dimensions equally, so the few dimensions with large values generate large rounding errors.

The fix: apply a random rotation matrix to the vector before quantizing it. A rotation does not change the information content of a vector. It just redistributes the energy across all dimensions. After rotation, the values are spread more evenly, and a simple low-bit quantizer can do a much better job.

TurboQuant uses a Walsh-Hadamard Transform to do this rotation. It is computationally cheap, data-free (you do not need calibration data to pick a good rotation), and it works well enough that 3-bit cache storage becomes practical without obvious quality degradation.

The community immediately started hacking on this. The bottleneck with TurboQuant in local runtimes turned out to be the dequantization cost. The Walsh-Hadamard Transform requires operating on groups of 128 elements together, which does not map cleanly onto modern GPU hardware. The math is correct but the kernel is slow.

Two simpler alternatives emerged from the open-source community:

PlanarQuant applies 2D rotations using Givens matrices. Instead of rotating a 128-dimensional vector as a whole, it pairs up dimensions and rotates each pair independently in a 2D plane. The rotation is still data-oblivious and still breaks up the outlier structure. The dequantization kernel becomes much simpler.

IsoQuant uses quaternion math to operate on groups of four dimensions at a time. Four-dimensional rotations using quaternions are hardware-friendly and appear frequently in real-time 3D graphics, which means the GPU can handle them efficiently.

Both approaches trade a small amount of the theoretical compression quality of TurboQuant for a significant speedup in dequantization. Community benchmarks have shown that the decode speed after switching from TurboQuant to IsoQuant or PlanarQuant can improve by 9 to 30 times in local runtimes, while the VRAM savings remain essentially identical.

This work has been actively pushed into llama.cpp as a community effort, with multiple contributors building the CUDA, Metal, and CPU implementations alongside each other. Running something like Gemma 4 27B at 256k context on a single RTX 4090, which was impossible a year ago, is now a real use case.

The Asymmetry Nobody Expects

One detail that comes out of every serious benchmarking effort on KV compression is that keys and values are not equal.

Keys control attention patterns. They determine which parts of the context the model pays attention to. If you degrade key quality too aggressively, the model starts attending to the wrong tokens and the outputs drift in subtle but cumulative ways.

Values carry the actual content. They are the information that gets mixed into the output. Values are more forgiving of compression errors, because the attention weights that multiply them act as a natural smoothing function.

The practical outcome is that the best setups use asymmetric compression. Keys at 3 bits, values at 4 bits. Or keys slightly higher than values for maximum quality preservation. This asymmetry is why tools like llama.cpp expose separate flags for the K and V cache types rather than a single quality dial.

A Different Angle: Train to Generate Multiple Tokens at Once

While the memory compression work was happening, a separate line of research tackled inference from the other direction.

Standard language model training teaches the model to predict exactly one next token, over and over. The model produces a probability distribution, you sample from it, that token gets added to the context, and the process repeats. One token per forward pass. One forward pass at a time.

Meta's research group published a paper called Better and Faster Large Language Models via Multi-Token Prediction that asked a direct question: what happens if you train the model to predict the next four tokens simultaneously?

The training architecture adds independent prediction heads on top of the shared model trunk. During training, the model has to get the next token right using each of four heads, each looking further ahead. The heads do not influence each other's predictions but they all share the same representation. This forced planning slightly ahead during training changes what the model learns about language structure.

At inference time, multi-token prediction enables speculative decoding in a natural way. The model generates a draft of several likely next tokens in one forward pass, and a verifier confirms or corrects them. When the draft is mostly right, which it often is for predictable continuations, you get 2 to 3 tokens for roughly the cost of one. Meta's paper reported up to a 3x inference speedup for models trained this way, particularly on code generation tasks where sequences are more locally predictable.

What makes this significant beyond the numbers is that multi-token prediction was not adopted just as a research curiosity. DeepSeek integrated it into their training pipeline. It showed up in the nanogpt speedrun as record 53. Llama 4 uses it. The technique crossed from paper to production across multiple major releases within about a year.

What This Means in Practice

A year ago, running a 70B model locally required either a multi-GPU setup worth several thousand euros or accepting severely quantized model weights that compromised quality noticeably. The KV cache alone for a 70B model at a modest context length could exhaust all available VRAM.

Today, with 3-bit to 4-bit rotation-based KV quantization and efficient asymmetric compression, the same GPU can hold a longer context with less quality loss. Combined with multi-token prediction cutting the number of forward passes needed per output, the practical throughput for running models locally has roughly doubled or more compared to the same hardware twelve months ago.

This is not being driven by a single lab or a single paper. It is a rolling accumulation of techniques, each one a few percent improvement, some of them coming from a community member on Reddit benchmarking KV cache flags, some of them coming from a Meta research team paper, some from engineers at companies you have never heard of working through the llama.cpp issue tracker.

The training side of AI gets measured in benchmark points and capability jumps. The inference side gets measured in tokens per second and gigabytes of VRAM saved. It is less photogenic, but for the people actually running these models day to day, it is the number that matters.


Links