Version: Next

Context Window

The context window is the total amount of text (measured in tokens) that a language model can process in a single request. It includes everything: your input, the model's previous responses, system instructions, and any other context. Think of it as the model's working memory, everything it can "see" when generating a response.

How a single request works

When you send a message to a model, your input is tokenized (broken into small chunks called tokens) and placed into the context window. The model processes these tokens and generates a response, which also consumes tokens from the context window.

A single input-output exchange

In this example, the input "some text" takes up 2 tokens and the model produces "some response" using another 2 tokens. The context window (shown on the right) tracks what the model has seen so far.

Why previous messages stay in the context window

Language models are stateless so they don't remember previous conversations on their own. To maintain a coherent multi-turn conversation, every previous message (both your inputs and the model's responses) must be sent back as part of the next request. This means the context window fills up with each exchange.

A multi-turn conversation showing accumulated tokens

Here, the second request includes the original input and response plus the new follow-up message. The context window now holds 11 tokens (2 + 2 + 3 + 4). As the conversation grows longer, more of the context window is consumed.

This is why long conversations can eventually hit the context limit, meaning the model runs out of space to hold previous exchanges plus a new response.

What happens as the context window fills up

As your conversation progresses and the context window approaches its limit, several factors change that can impact the quality and reliability of the model's responses. The following plot provides the claimed versus the effective context length of some models. As you can see, the effective context length is much smaller than the claimed one in all models studied.

claimed vs actual

Source: A Comprehensive Survey on Long Context Language Modeling(section 7.1.1 Table 9)

Context window in practice

Tools like Claude Code show you exactly how the context window is being used. The breakdown typically includes:

System prompt - instructions that define the model's behavior
System tools - tool definitions available to the model
Messages - the actual conversation history (inputs + outputs)
Free space - remaining capacity for new messages and responses

Claude Code context

Source: Output of /context command in Claude Code v2.1.52

When the context window fills up, the system must either compress older messages, drop them, or start a new conversation. Understanding this helps you write more efficient prompts and manage longer coding sessions.

Autocompaction

When the context window is nearly full, tools like Claude Code may autocompact the conversation (automatically summarizing or dropping older messages to free up space). This keeps the conversation going, but comes with trade-offs:

Lost detail - specific code snippets, file paths, or exact instructions from earlier in the conversation may be compressed into vague summaries
Broken references - if you refer back to something discussed earlier, the model may no longer have access to the original content
Repeated work - the model might re-read files or re-run commands it already performed, because the earlier results were compacted away

You can usually tell autocompaction has occurred when the model seems to "forget" something you discussed a few messages ago, or when it starts asking questions you've already answered.

Information ambiguity

A full context window means the model is juggling a large amount of information at once. This can lead to:

Conflicting instructions - earlier guidance may contradict later corrections, and the model has to choose which to follow. With more context, there are more opportunities for ambiguity
Diluted focus - with long contexts, important instructions must compete with lots of other content. This can make key requirements less salient unless they are repeated clearly
Stale context - code you discussed at the start of the session may have since been modified. The model might reference outdated versions of files that are still sitting in the context window

Degraded performance

Large contexts don't just risk information loss and ambiguity, they can reduce overall response quality:

Slower responses - processing more tokens takes more time and costs more per request
Lower accuracy - models perform worse on tasks that require finding specific information buried in long contexts (the "needle in a haystack" problem). Research such as Lost in the Middle (Liu et al.) and Context Rot (Chroma Research) demonstrates that performance degrades as context length increases, with information in the middle of long contexts being particularly hard for models to use
Reduced instruction following - with thousands of tokens of conversation history, the model is more likely to drift from your original intent or constraints

Practical tips

Start fresh sessions for new tasks rather than continuing a long conversation
Be explicit when referencing earlier context - re-state key details rather than saying "like I mentioned before"
Front-load important instructions - system prompts and key constraints should be at the start, where they're less likely to be compacted
Watch the context indicator - when you're past 50-60% usage, consider whether continuing the session or starting a new one makes more sense

When this guidance may not apply

If your task requires long, immutable source material (for example policy bundles), starting a new session can lose useful continuity
Some tools use retrieval and persistent memory features that reduce, but do not eliminate, context pressure
For short one-step edits, context optimization has minimal practical impact

Additional sources: Anthropic Messages API (stateless multi-turn behavior), OpenAI conversation state guide, Claude Code /compact behavior, Claude Code costs and context management

Context window limit for popular models

Different models offer different context window sizes and pricing. Larger context windows allow for longer conversations and more code to be analyzed at once, but typically cost more per million token.

Model	Context	Max Output	Input ($/M)	Output ($/M)	Reasoning	Tool Use
Claude Opus 4.6	200K	128K	$5.00	$25.00	Yes	Yes
Claude Sonnet 4.6	200K	64K	$3.00	$15.00	Yes	Yes
Claude Haiku 4.5 (latest)	200K	64K	$1.00	$5.00	Yes	Yes
GPT-5.2	400K	128K	$1.75	$14.00	Yes	Yes
GPT-5 Mini	400K	128K	$0.25	$2.00	Yes	Yes
o4-mini	200K	100K	$1.10	$4.40	Yes	Yes
Gemini 2.5 Pro	1.0M	66K	$1.25	$10.00	Yes	Yes
Gemini 2.5 Flash	1.0M	66K	$0.30	$2.50	Yes	Yes
Gemini 2.5 Flash Lite	1.0M	66K	$0.10	$0.40	Yes	Yes

Source: models.dev | Data fetched: 3/4/2026, 10:38:54 PM

How a single request works​

Why previous messages stay in the context window​

What happens as the context window fills up​

Context window in practice​

Autocompaction​

Information ambiguity​

Degraded performance​

Practical tips​

When this guidance may not apply​

Context window limit for popular models​