If you've fixed the prompt 10 times and the agent still keeps failing, it might not be the prompt at all.
In June 2025, Andrej Karpathy posted one line on X: "context engineering is the delicate art and science of filling the context window with just the right information." Shopify CEO Tobi Lutke echoed it immediately. "Prompt engineering" as a term is wrong — the real battle happens at the context layer.
Here's what everyone believes
When an AI agent gives a weird answer, most people start here: make the system prompt more specific, add few-shot examples, be more explicit about output format. And honestly, this works fine for single-turn tasks.
But when an agent needs to run multiple steps, use tools, remember previous conversations, and pull from internal data — real production work — prompt optimization becomes less and less meaningful. No matter how polished your prompt is, the agent can't know what it doesn't know.
But the data says the opposite
Firecrawl's analysis of recent research is specific: distributing prompt content across multiple conversation turns versus providing it upfront causes an average 39% performance drop. Databricks' testing of Llama 3.1 405b showed accuracy visibly degrading once context exceeded 32,000 tokens.
What you put in the context window — and how — matters far more than how your prompt is phrased.
Elastic nailed the key distinction: "Prompt engineering takes the context window as given. Context engineering actively curates it."
Neo4j's 4 agent context failure modes:
| Failure Mode | What Happens | Symptom |
|---|---|---|
| Context Poisoning | Hallucinations stay in conversation history and keep getting referenced | Errors compound and worsen |
| Context Distraction | Model over-relies on conversation history over training | Ignores training knowledge, repeats wrong answers |
| Context Confusion | Irrelevant information influences responses | Off-topic content bleeds into answers |
| Context Clash | Contradictory information coexists in the context window | Inconsistent answers |
None of these can be fixed with better prompting. Every single one is a question of what goes into context, when, and how.
So what is context engineering?
One sentence: instead of optimizing how you ask the model, you're designing the environment where the model does its work.
Neo4j defines context engineering's scope as: retrieval pipeline design, memory strategy, tool schema and policy definition, task state tracking, reasoning history management. Direct comparison:
| Prompt Engineering | Context Engineering | |
|---|---|---|
| Core question | How should I phrase this? | What does the model need to know? |
| Scope | Single input text | Full information architecture |
| Best for | Single-turn tasks, basic classification | Multi-step agents, long-horizon workflows |
| When it fails | Rephrase the prompt | Redesign retrieval, memory, and tool structure |
| Scale | Personal use, prototyping | Production AI systems |
The key concept is Minimum Viable Context (MVC) — giving the model only the minimum high-signal information it needs. Too much dilutes attention; too little causes hallucination. Just enough.
5 context elements for an ideal agent call
① User goal ② Most relevant retrieved results only ③ Required tool definitions ④ Applicable policies ⑤ Compressed memory summary — that's all you need.
GraphRAG: the foundation of context design
Traditional RAG fetches text chunks by vector similarity — isolated pieces that struggle with multi-hop reasoning and introduce noise.
GraphRAG stores entities and relationships in a structured graph. It answers complex questions like "if A affects B, what happens to C?", applies access controls at retrieval time, and traces the relationship path behind each conclusion. Agentic GraphRAG — combining vector search with graph traversal — is now the core architecture for context engineering.
How to start right now
- Diagnose failure mode first
Look at agent error logs and conversation history — patterns emerge. Figure out which of the four modes you're dealing with. - Define core knowledge domains
List what the agent absolutely must know to do its job. This becomes the skeleton of your knowledge graph. - Audit your RAG pipeline
If you're offering more than 30 tools simultaneously, filter down with RAG. Research shows trimming below 30 boosts tool selection accuracy by 3x. - Separate memory layers
Design short-term (current session), mid-term (user history), and long-term (domain knowledge) memory separately. Stacking all past conversations causes context distraction. - Trim to Minimum Viable Context
Deliberately reduce what enters your context window. If performance improves when you put less in, that was context overload.
This isn't saying prompt engineering is useless
Context engineering is a superset of prompt engineering, not a replacement. Prompt optimization is still very effective for single-turn tasks and rapid prototyping. Context engineering is what you need when agents handle complex multi-step work.
Dive Deeper
Context Engineering vs. Prompt Engineering (Neo4j) The original article where the GraphRAG and MVC frameworks were first laid out neo4j.com
Context Engineering for AI Agents (Firecrawl) Deep dive on 32K token limits, 4 failure modes, and tool optimization data firecrawl.dev
Context Engineering vs. Prompt Engineering (Elastic) The "actively curates the context window" practical perspective elastic.co
Andrej Karpathy on Context Engineering The original X post that kicked off this paradigm shift x.com
Why GraphRAG and MCP Are the New Standard Why GraphRAG became the standard for agentic data architecture hyperight.com
Prompt vs. Context Engineering (FastCampus) Korean-language comparison guide for both concepts media.fastcampus.co.kr




