dist.neo4j.com

If You've Fixed the Prompt 10 Times and the Agent Still Fails — Karpathy Said This a Year Ago

context engineering, prompt engineering difference, AI agents, GraphRAG, KarpathyDev

Context Engineering vs. Prompt Engineering

Andrej Karpathy on context engineering

Context Engineering vs Prompt Engineering for AI Agents

If you've fixed the prompt 10 times and the agent still keeps failing, it might not be the prompt at all.

In June 2025, Andrej Karpathy posted one line on X: "context engineering is the delicate art and science of filling the context window with just the right information." Shopify CEO Tobi Lutke echoed it immediately. "Prompt engineering" as a term is wrong — the real battle happens at the context layer.

TL;DR

Prompt limits hit → 4 failure modes → Context engineering → GraphRAG · MVC → 5-step start

Here's what everyone believes

When an AI agent gives a weird answer, most people start here: make the system prompt more specific, add few-shot examples, be more explicit about output format. And honestly, this works fine for single-turn tasks.

But when an agent needs to run multiple steps, use tools, remember previous conversations, and pull from internal data — real production work — prompt optimization becomes less and less meaningful. No matter how polished your prompt is, the agent can't know what it doesn't know.

But the data says the opposite

Firecrawl's analysis of recent research is specific: distributing prompt content across multiple conversation turns versus providing it upfront causes an average 39% performance drop. Databricks' testing of Llama 3.1 405b showed accuracy visibly degrading once context exceeded 32,000 tokens.

What you put in the context window — and how — matters far more than how your prompt is phrased.

Elastic nailed the key distinction: "Prompt engineering takes the context window as given. Context engineering actively curates it."

Neo4j's 4 agent context failure modes:

Failure Mode	What Happens	Symptom
Context Poisoning	Hallucinations stay in conversation history and keep getting referenced	Errors compound and worsen
Context Distraction	Model over-relies on conversation history over training	Ignores training knowledge, repeats wrong answers
Context Confusion	Irrelevant information influences responses	Off-topic content bleeds into answers
Context Clash	Contradictory information coexists in the context window	Inconsistent answers

None of these can be fixed with better prompting. Every single one is a question of what goes into context, when, and how.

So what is context engineering?

One sentence: instead of optimizing how you ask the model, you're designing the environment where the model does its work.

Neo4j defines context engineering's scope as: retrieval pipeline design, memory strategy, tool schema and policy definition, task state tracking, reasoning history management. Direct comparison:

	Prompt Engineering	Context Engineering
Core question	How should I phrase this?	What does the model need to know?
Scope	Single input text	Full information architecture
Best for	Single-turn tasks, basic classification	Multi-step agents, long-horizon workflows
When it fails	Rephrase the prompt	Redesign retrieval, memory, and tool structure
Scale	Personal use, prototyping	Production AI systems

The key concept is Minimum Viable Context (MVC) — giving the model only the minimum high-signal information it needs. Too much dilutes attention; too little causes hallucination. Just enough.

5 context elements for an ideal agent call

① User goal ② Most relevant retrieved results only ③ Required tool definitions ④ Applicable policies ⑤ Compressed memory summary — that's all you need.

GraphRAG: the foundation of context design

Traditional RAG fetches text chunks by vector similarity — isolated pieces that struggle with multi-hop reasoning and introduce noise.

GraphRAG stores entities and relationships in a structured graph. It answers complex questions like "if A affects B, what happens to C?", applies access controls at retrieval time, and traces the relationship path behind each conclusion. Agentic GraphRAG — combining vector search with graph traversal — is now the core architecture for context engineering.

39%

Avg performance drop from context distribution

32K

Token threshold where accuracy degrades

Tool selection accuracy with under 30 tools

How to start right now

Diagnose failure mode first
Look at agent error logs and conversation history — patterns emerge. Figure out which of the four modes you're dealing with.
Define core knowledge domains
List what the agent absolutely must know to do its job. This becomes the skeleton of your knowledge graph.
Audit your RAG pipeline
If you're offering more than 30 tools simultaneously, filter down with RAG. Research shows trimming below 30 boosts tool selection accuracy by 3x.
Separate memory layers
Design short-term (current session), mid-term (user history), and long-term (domain knowledge) memory separately. Stacking all past conversations causes context distraction.
Trim to Minimum Viable Context
Deliberately reduce what enters your context window. If performance improves when you put less in, that was context overload.

This isn't saying prompt engineering is useless

Context engineering is a superset of prompt engineering, not a replacement. Prompt optimization is still very effective for single-turn tasks and rapid prototyping. Context engineering is what you need when agents handle complex multi-step work.

Dive Deeper

Context Engineering vs. Prompt Engineering (Neo4j) The original article where the GraphRAG and MVC frameworks were first laid out neo4j.com

Context Engineering for AI Agents (Firecrawl) Deep dive on 32K token limits, 4 failure modes, and tool optimization data firecrawl.dev

Context Engineering vs. Prompt Engineering (Elastic) The "actively curates the context window" practical perspective elastic.co

Andrej Karpathy on Context Engineering The original X post that kicked off this paradigm shift x.com

Why GraphRAG and MCP Are the New Standard Why GraphRAG became the standard for agentic data architecture hyperight.com

Prompt vs. Context Engineering (FastCampus) Korean-language comparison guide for both concepts media.fastcampus.co.kr

FAQ

Do I need to learn context engineering now? I already write good prompts.

Not immediately, if you only use AI for single-turn tasks or basic chatbots. The signal that context engineering matters is when your agent runs multi-step workflows, or needs to reference internal data or real-time information. That's when you shift focus from prompt wording to context architecture.

Is context engineering the same as RAG?

RAG is one tool within context engineering, not the same thing. Context engineering is a broader discipline that includes RAG plus memory systems, tool schema design, policy filtering, and task state tracking. GraphRAG is an evolution of RAG that adds structured relationship data.

Is this relevant for a solo developer or small startup?

If you're using AI tools purely for personal productivity, not right now. But if you're building AI agents into a product or service, not understanding context design means you can't even diagnose why the agent misbehaves. For now, just understanding the concepts is enough.

Context windows keep getting longer. Does that make context engineering less important?

Actually the opposite. Longer windows let you put more in, but models still struggle to maintain focus over very long contexts. Databricks research shows accuracy degrading past 32K tokens. Getting value from a longer window requires more precise context design, not less.

What tools do I need for GraphRAG?

Neo4j or Graphiti (open source) are the most commonly used graph databases. Frameworks like LangChain and LlamaIndex have GraphRAG integrations built in. If you're starting out, Graphiti is a good choice — open source with no cost barrier.

Written by Rush

Tracking where business meets AI.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Antioch — Meet the Cursor for Robot AI

Physical AI startups no longer need to rent warehouses or build million-dollar test facilities. Antioch brings software-speed development to robotics through cloud simulation — and just raised $8.5M seed to prove it.

Explore more AI workflow guides on similar topics

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

morningbrew.com

Medvi telehealth, AI startup leverage, GLP-1 startup, one-person unicorn, AI operations

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

Matthew Gallagher built Medvi, a GLP-1 telehealth startup, in 14 months with $20,000 and AI tools. 2 employees. 16.2% net margin. $401M in year one. Here's how the model works — and where it's breaking.

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

substackcdn.com

What if your code review was already done when you woke up, and your newsletter

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

What if your code review was already done when you woke up, and your newsletter sources were already organized? Here's how to automate recurring tasks with Claude Code Scheduled Task.

Next →Antioch — Meet the Cursor for Robot AI