What is a context window?

7 min read

·
┌──────────────────────────────────────────────────────────┐
│  ═══════════════════════════════════════════════════     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  █████████████████████████████████░░░░░░░░░░░░░░░░░░     │
│  ██████████████████████████████████████░░░░░░░░░░░░░     │
│  ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ███████████████████████████████████████░░░░░░░░░░░░     │
└──────────────────────────────────────────────────────────┘

A context window is the maximum amount of text a language model can consider at once, including both the input you provide and the output it generates. Think of it as the model's working memory. Everything the model can "see" during a single request must fit within its context window.

What It Means and Why It Matters

────────────────────────────────────────

When you send a prompt to an AI model, the model does not have access to the internet, your file system, or any external data source in real time. It can only work with the text inside the context window. If information is not in the context, the model does not know about it (unless it was part of the training data).

This has practical implications. If you are building a chatbot, the entire conversation history must fit in the context window. If you are asking the model to analyze a document, that document must fit. If you are doing retrieval-augmented generation, the retrieved chunks plus your prompt plus the expected response all need to fit within the limit.

How Context Windows Are Measured

────────────────────────────────────────

Context windows are measured in [tokens], not words or characters. A token is roughly 4 characters or 0.75 words in English, but this varies by language and tokenizer. Code tends to use more tokens per line than prose. Non-English languages often use more tokens per word.

When a provider says their model has a 128K context window, they mean 128,000 tokens total, shared between your input and the model's output. If you send 100,000 tokens of input, the model has 28,000 tokens remaining for its response.

Current Context Window Sizes

────────────────────────────────────────

Context windows have grown dramatically and vary significantly across providers:

[Anthropic Claude]: Claude 3.5 Sonnet and Claude 3 Opus support 200,000 tokens. This is enough for roughly 150,000 words, or a 500-page book.

[Google Gemini]: Gemini 1.5 Pro supports up to 1 million tokens, and Gemini 1.5 Flash supports up to 1 million as well. Google has demonstrated prototypes handling up to 10 million tokens. This is the largest generally available context window.

[OpenAI GPT-4]: GPT-4 Turbo and GPT-4o support 128,000 tokens. Earlier GPT-4 versions had 8,000 or 32,000 tokens.

[Mistral]: Mistral Large supports up to 128,000 tokens. Their smaller models vary between 32,000 and 128,000 tokens.

[Meta Llama]: Llama 3 models support up to 128,000 tokens in their extended context versions.

[Cohere Command]: Command R+ supports 128,000 tokens, with strong performance on long-context tasks.

What Happens When You Exceed the Limit

────────────────────────────────────────

If your input exceeds the context window, different things happen depending on the provider:

  • Most APIs return an error telling you the input is too long
  • Some providers silently truncate the input, dropping tokens from the beginning or end
  • In chat interfaces, older messages may be automatically dropped from the conversation

None of these outcomes are good. Errors break your application flow. Truncation can lose critical context. The best approach is to monitor your token usage and manage it proactively.

Context Window vs Long-Term Memory

────────────────────────────────────────

The context window is not memory in the way humans think about it. It resets with every API call. The model does not remember your previous conversations unless you explicitly include them in the context.

Some providers and frameworks are building [persistent memory] systems that store information across conversations. These typically work by summarizing past interactions and injecting relevant summaries into the context window, or by maintaining a vector database of past conversations and retrieving relevant snippets.

The distinction matters: the context window is the model's working memory for a single request. Long-term memory requires external systems.

Strategies for Managing Context

────────────────────────────────────────

When your data exceeds the context window, or when you want to use context efficiently, several strategies help:

[Retrieval-augmented generation (RAG)]: Instead of stuffing everything into the context, embed your documents in a vector database and retrieve only the relevant chunks for each query. This lets you work with unlimited data while using a fraction of the context window.

[Summarization]: For long conversations, periodically summarize older messages and replace the full history with the summary. This compresses the context while retaining the key information.

[Chunking]: Break large documents into smaller pieces and process them independently. Then combine the results. This works well for tasks like extracting data from each section of a long report.

[Sliding window]: Process a document in overlapping chunks, moving through it like a sliding window. Each chunk has some overlap with the previous one to maintain continuity.

[Hierarchical processing]: Summarize sections individually, then summarize the summaries. This lets you distill a book-length document into something that fits in a single context window.

[Prioritization]: Put the most important information at the beginning and end of the context. Research has shown that models pay more attention to information at these positions, a phenomenon sometimes called "lost in the middle."

Cost Implications

────────────────────────────────────────

Larger context windows are more expensive. Providers charge per token for both input and output. Sending 100,000 tokens of context costs significantly more than sending 1,000 tokens, even if the question is the same.

There is also a latency impact. Longer contexts take longer to process. The model's attention mechanism scales quadratically with context length in traditional transformer architectures, though providers are implementing optimizations to reduce this.

This creates a practical tradeoff: you could dump everything into a 1-million-token context window, but it is often cheaper, faster, and more effective to use RAG to select the relevant 5,000 tokens.

The Trend Toward Longer Contexts

────────────────────────────────────────

Context windows have grown exponentially. GPT-3 had 4,096 tokens. GPT-4 Turbo jumped to 128,000. Gemini reached 1 million. Providers are clearly pushing toward longer contexts.

This trend is driven by demand. Developers want to analyze entire codebases, process complete legal documents, and maintain long conversation histories. Longer contexts make these use cases possible without complex chunking strategies.

However, longer context windows do not eliminate the need for RAG and other context management techniques. Even with a million-token window, you still face cost, latency, and attention quality tradeoffs. The most effective approach often combines a generous context window with smart retrieval to get the right information in front of the model.

Understanding context windows is fundamental to building effective AI applications. It affects your architecture, your costs, your user experience, and the quality of your model's responses.

Related Articles

Building with AI