If you've fixed the prompt 10 times and the agent still keeps failing, it might not be the prompt at all.

In June 2025, Andrej Karpathy posted one line on X: "context engineering is the delicate art and science of filling the context window with just the right information." Shopify CEO Tobi Lutke echoed it immediately. "Prompt engineering" as a term is wrong — the real battle happens at the context layer.

TL;DR
Prompt limits hit 4 failure modes Context engineering GraphRAG · MVC 5-step start

Here's what everyone believes

When an AI agent gives a weird answer, most people start here: make the system prompt more specific, add few-shot examples, be more explicit about output format. And honestly, this works fine for single-turn tasks.

But when an agent needs to run multiple steps, use tools, remember previous conversations, and pull from internal data — real production work — prompt optimization becomes less and less meaningful. No matter how polished your prompt is, the agent can't know what it doesn't know.

But the data says the opposite

Firecrawl's analysis of recent research is specific: distributing prompt content across multiple conversation turns versus providing it upfront causes an average 39% performance drop. Databricks' testing of Llama 3.1 405b showed accuracy visibly degrading once context exceeded 32,000 tokens.

What you put in the context window — and how — matters far more than how your prompt is phrased.

Elastic nailed the key distinction: "Prompt engineering takes the context window as given. Context engineering actively curates it."

Neo4j's 4 agent context failure modes:

Failure ModeWhat HappensSymptom
Context PoisoningHallucinations stay in conversation history and keep getting referencedErrors compound and worsen
Context DistractionModel over-relies on conversation history over trainingIgnores training knowledge, repeats wrong answers
Context ConfusionIrrelevant information influences responsesOff-topic content bleeds into answers
Context ClashContradictory information coexists in the context windowInconsistent answers

None of these can be fixed with better prompting. Every single one is a question of what goes into context, when, and how.

So what is context engineering?

One sentence: instead of optimizing how you ask the model, you're designing the environment where the model does its work.

Neo4j defines context engineering's scope as: retrieval pipeline design, memory strategy, tool schema and policy definition, task state tracking, reasoning history management. Direct comparison:

Prompt EngineeringContext Engineering
Core questionHow should I phrase this?What does the model need to know?
ScopeSingle input textFull information architecture
Best forSingle-turn tasks, basic classificationMulti-step agents, long-horizon workflows
When it failsRephrase the promptRedesign retrieval, memory, and tool structure
ScalePersonal use, prototypingProduction AI systems

The key concept is Minimum Viable Context (MVC) — giving the model only the minimum high-signal information it needs. Too much dilutes attention; too little causes hallucination. Just enough.

5 context elements for an ideal agent call

① User goal ② Most relevant retrieved results only ③ Required tool definitions ④ Applicable policies ⑤ Compressed memory summary — that's all you need.

GraphRAG: the foundation of context design

Traditional RAG fetches text chunks by vector similarity — isolated pieces that struggle with multi-hop reasoning and introduce noise.

GraphRAG stores entities and relationships in a structured graph. It answers complex questions like "if A affects B, what happens to C?", applies access controls at retrieval time, and traces the relationship path behind each conclusion. Agentic GraphRAG — combining vector search with graph traversal — is now the core architecture for context engineering.

39%
Avg performance drop from context distribution
32K
Token threshold where accuracy degrades
3x
Tool selection accuracy with under 30 tools

How to start right now

  1. Diagnose failure mode first
    Look at agent error logs and conversation history — patterns emerge. Figure out which of the four modes you're dealing with.
  2. Define core knowledge domains
    List what the agent absolutely must know to do its job. This becomes the skeleton of your knowledge graph.
  3. Audit your RAG pipeline
    If you're offering more than 30 tools simultaneously, filter down with RAG. Research shows trimming below 30 boosts tool selection accuracy by 3x.
  4. Separate memory layers
    Design short-term (current session), mid-term (user history), and long-term (domain knowledge) memory separately. Stacking all past conversations causes context distraction.
  5. Trim to Minimum Viable Context
    Deliberately reduce what enters your context window. If performance improves when you put less in, that was context overload.

This isn't saying prompt engineering is useless

Context engineering is a superset of prompt engineering, not a replacement. Prompt optimization is still very effective for single-turn tasks and rapid prototyping. Context engineering is what you need when agents handle complex multi-step work.

Dive Deeper

Context Engineering vs. Prompt Engineering (Neo4j) The original article where the GraphRAG and MVC frameworks were first laid out neo4j.com

Context Engineering for AI Agents (Firecrawl) Deep dive on 32K token limits, 4 failure modes, and tool optimization data firecrawl.dev

Context Engineering vs. Prompt Engineering (Elastic) The "actively curates the context window" practical perspective elastic.co

Andrej Karpathy on Context Engineering The original X post that kicked off this paradigm shift x.com

Why GraphRAG and MCP Are the New Standard Why GraphRAG became the standard for agentic data architecture hyperight.com

Prompt vs. Context Engineering (FastCampus) Korean-language comparison guide for both concepts media.fastcampus.co.kr