Long Context Inference and Key-value Caching: What It Is, and Why It Matters

Abstract: Why are modern large language models (LLMs) so expensive to run? We argue that a major reason for this is the rapid growth of context lengths, enabling use cases such as code synthesis from an entire repository, chain of thought reasoning, agentic workflows with many tools, and chat conversations with many turns. The main bottleneck for long context inference is the key-value cache, which in principle grows linearly with context width, embedding dimension, and number of layers. We review major directions of long context inference and KV caching, and argue that selective KV caching (aka sparse attention) in particular is a key direction for decision-making research aiming to make LLM inference more affordable.

Bio: Matthias W. Seeger received a Ph.D. from Edinburgh in 2003, did postdocs at Berkeley (with Michael Jordan) and MPI Tuebingen (with Bernhard Schoelkopf), led a research group in Saarbruecken, and was assistant professor at the EPF Lausanne. He joined Amazon in 2014, where he is currently a principal applied scientist. He received the ICML Test of Time Award in 2020.