02/17/2026 | Press release | Archived content
A practical framework for context engineering in LLM systems
Most teams building with large language models share the same instinct: If the model makes mistakes, add more context.
Early on, this works. Accuracy improves. Hallucinations drop. Demos get better.
Then something strange happens. Latency spikes. Outputs become inconsistent. Debugging turns into guesswork. And despite adding more context, the model performs worse.
This post introduces a practical framework for understanding why this happens, and provides concrete guidelines for context engineering based on conversations with practitioners building real-world LLM systems.
In most AI systems, context is treated as an unpriced resource. System prompts, conversation history, retrieved documents, user memory, and tool outputs are all stuffed into a single context window with the assumption that the model will sort it out.
In reality, context has real costs:
The most reliable AI systems do not maximize context. They optimize it.
Context is everything the model sees before it produces an output, including:
The most common mistake teams make is treating all of this as interchangeable text. It is not.
The C³ framework classifies context based on two properties:
Stable, rarely changing, highest authority
Examples include:
Cold context defines the rules of the system. If it is wrong, the failure can be catastrophic.
Best Practices:
Anti-pattern:
Pasting an entire policy document into the system prompt.
Slowly changing, moderately authoritative
Examples include:
Warm context provides continuity, but it comes with risks:
Best Practices:
Anti-pattern:
Appending full conversation history indefinitely.
Ephemeral, volatile, lowest authority
Examples include:
Hot context changes on every request and has the highest entropy. It is also the most likely to be wrong.
Best Practices:
Anti-pattern:
Dumping every tool response back into the prompt.
The key insight behind C³ is that context value grows sub-linearly, while context cost grows linearly or worse.
Early context additions dramatically improve accuracy. Past a certain point, returns diminish. Beyond that point, accuracy often declines.
This happens for several reasons:
Different context types hit this inflection point at different times:
This relationship is the Context Cost Curve.
Not all context deserves equal weight. Reliable systems assemble context deliberately, from the highest authority to the lowest:
As you move down the stack:
A simple rule follows: lower layers should never override higher layers. Most hallucinations are violations of this rule.
Teams that take context seriously do a few simple things consistently:
In practice, most reliability issues in LLM systems do not come from model choice or prompt wording. They come from treating context as an unlimited buffer rather than a constrained design surface.
Once teams become explicit about what information deserves to be present, how long it should live, and how much authority it carries, much of the complexity disappears. Context stops being a source of surprises and becomes something you can reason about, measure, and improve.
If you are building an AI product and finding that reliability, latency, or evaluation quality degrades as you scale, I would welcome the chance to compare notes. I spend a lot of time working with teams navigating these exact problems and am always interested in learning from what others are seeing in the wild. Please reach out!