A counterintuitive benchmark result that changed my approach
The MCPMark V2 benchmarks revealed something I did not expect.
When Claude moved from Sonnet 4.5 to Sonnet 4.6, backend token usage through Supabase's MCP server went up - from 11.6 million to 17.9 million tokens across 21 database tasks. The new model is smarter. It should be more efficient. And yet the bill was higher.
At first glance, this makes no sense. A more intelligent model should solve problems faster, use fewer iterations, make fewer mistakes. Greater reasoning capability should translate into lower token usage.
But that is not what happens - and understanding why is the key to stopping the waste of money on Claude Code sessions.
The real reason: the model is not the problem
The reason is subtle and has nothing to do with the model itself.
It has everything to do with how the backend exposes information to the agent.
When context is incomplete, a more capable model does not simply skip the gap. It explores it. It spends more tokens reasoning about the gap, runs more discovery queries, and retries more frequently. The missing context does not disappear with a better model. It gets more expensive.
This is a fundamental shift in how to think about AI cost optimization. It is not about choosing a cheaper model or shortening prompts. It is about what you get back from your tools.
Karpathy put it precisely: "context engineering is the delicate art and science of filling the context window with just the right information for the next step." Notice that he includes tools and state as part of that context - not just prompts.
Most people apply context engineering to prompts and RAG retrieval. The backend is also part of the context window. And it is the part almost nobody is optimizing.
Why Supabase's MCP server wastes tokens
Supabase is a great backend. But it was not designed to be operated by AI agents - and the MCP server added later inherits that limitation.
Three specific mechanisms cause the token bloat.
Problem 1: Documentation retrieval returns everything
When Claude Code needs to set up Google OAuth through Supabase, it invokes the search_docs MCP tool.
Supabase's implementation returns full GraphQL schema metadata on every call - containing 5-10x more tokens than the agent actually needs.
If the agent asked for OAuth setup instructions, it got the entire authentication docs - including sections on email/password, magic links, phone auth, SAML, and SSO.
This happens on every search_docs call - for database queries, storage configuration, and edge function deployment. Each call dumps the full metadata for that entire domain.
Across a session where the agent sets up auth, database, storage, and functions, the docs overhead alone can account for thousands of wasted tokens.
Problem 2: No visibility into backend state
When you use Supabase as a human developer, you open the dashboard and see everything at a glance - active auth providers, tables, RLS policies, configured storage buckets, deployed edge functions.
An agent cannot see the dashboard.
Supabase's MCP server does expose some state through individual tools like list_tables and execute_sql, but there is no way to ask "what does my entire backend look like right now?" and get one structured response.
So the agent pieces it together through multiple calls. Each call returns a partial view. Some information - like which auth providers are configured - is not available through MCP at all.
This fragmented discovery process costs tokens, and the agent often needs several attempts because the information comes back incomplete or in a format that requires further queries to interpret.
Problem 3: Errors without structured context
When something goes wrong, Supabase returns raw error messages. It could be a 403 from an RLS denial, a 500 from a misconfigured edge function.
A human developer would look at it, check the Supabase dashboard, cross-reference with the logs, and fix the issue.
The agent does not have that path. It gets the error message, reasons about what might have caused it, and tries a fix. If the fix is wrong, it retries. Each retry re-sends the entire conversation history and compounds the token cost.
These three mechanisms - doc overhead, fragmented state discovery, and error retry loops - compound fast. A model that reasons more extensively, like Sonnet 4.6, makes each exploration step more thorough and more expensive. That is why the token gap widened from Sonnet 4.5 to 4.6 - and it will likely widen further with each new model release.
What backend context engineering should look like
The fix is not switching to another model.
It is giving the agent a structured backend context so it does not have to explore and guess.
This is what Karpathy means by context engineering: filling the context window with the right information. He explicitly includes tools and state as part of that context. Most people apply this idea to prompts and RAG retrieval.
The backend is also part of the context window. And it is the part almost nobody is optimizing.
To see what this looks like in practice, the InsForge project (open source with 8k stars) implements this approach. It provides the same primitives as Supabase - Postgres with pgvector, auth, storage, edge functions, realtime - but structures the information layer so agents can consume it efficiently.
The key architectural difference is how it delivers context to Claude Code. Three layers work together:
- Skills for static knowledge
- CLI for direct backend operations
- MCP for live state inspection
Each layer solves a different problem and reduces tokens for a different reason.
Layer 1: Skills - static knowledge with zero round-trips
The primary approach for knowledge is Skills. They load directly into the agent's context at session start, so the SDK patterns, code examples, and edge cases for every backend operation are available without any tool calls.
Skills also use progressive disclosure - only the metadata (name, description, around 70-150 tokens per skill) loads initially. The full skill content loads only when the agent determines it matches the current task. This means you can have 100+ skills installed without context bloat - which is not possible with MCP's all-or-nothing schema loading.
Four skills cover the full stack, each scoped to a specific domain:
- insforge - for frontend code that talks to the backend
- insforge-cli - for backend infrastructure management
- insforge-debug - for structured error diagnosis across common failures like auth errors, slow queries, edge function failures, RLS denials, deployment issues, and performance degradation
- insforge-integrations - for third-party auth providers (Clerk, Auth0, WorkOS, Kinde, Stytch)
Install all four with one command:
npx skills add insforge/insforge-skillsThe key difference from Supabase: when you are writing frontend code, only insforge activates. When you are creating tables, only insforge-cli activates. When something breaks, only insforge-debug activates. The other three remain at metadata-only cost.
Supabase ships one broad skill that triggers on any task involving Supabase - covering databases, auth, edge functions, realtime, storage, vectors, cron, queues, client libraries, SSR integrations, CLI, MCP, schema changes, migrations, and Postgres extensions. When the Supabase skill activates, all its content loads because the trigger conditions cover almost the entire product surface.
Layer 2: CLI - direct execution with structured feedback
For actually executing backend operations - creating tables, running SQL, deploying functions, managing secrets - the InsForge CLI is the primary interface.
Every command supports --json for structured output, -y to skip confirmation prompts, and returns semantic exit codes so agents can detect auth failures, missing projects, or permission errors programmatically.
This is helpful because Claude Code can pipe CLI output through jq, grep, and awk in ways that would require multiple sequential MCP tool calls.
Benchmarks from Scalekit showed CLI + Skills achieving near-100% success rates with 10-35x better token efficiency than equivalent MCP setups for single-user workflows.
Here are some example operations the agent actually runs:
# Inspect backend state (run first to discover what is configured)
npx @insforge/cli metadata --json
# Database operations
npx @insforge/cli db query "CREATE TABLE posts (...)" --json
npx @insforge/cli db policies # inspect existing RLS policies
# Edge functions
npx @insforge/cli functions deploy my-handler
npx @insforge/cli functions invoke my-handler --data '{"action":"test"}' --json
# Storage
npx @insforge/cli storage create-bucket documents --json
npx @insforge/cli storage upload ./file.pdf --bucket documents
# Frontend deployment
npx @insforge/cli deployments env set VITE_INSFORGE_URL https://...
npx @insforge/cli deployments deploy ./dist --json
# Diagnostics
npx @insforge/cli diagnose db --check connections,locks,slow-queriesThe agent parses the JSON and handles errors based on exit codes. No guessing. No ambiguous error messages. Every operation returns structured confirmation of what happened.
Layer 3: MCP - only for live state inspection
MCP is still useful, but for a narrower purpose - inspecting the current state of your backend when that state is changing.
InsForge's MCP server exposes a lightweight get_backend_metadata tool that returns a structured JSON with the full backend topology in a single call:
{
"auth": {
"providers": ["google", "github"],
"jwt_secret": "configured"
},
"tables": [
{"name": "users", "columns": ["id", "email", "created_at"], "rls": "enabled"},
{"name": "posts", "columns": ["id", "title", "body", "author_id"], "rls": "enabled"}
],
"storage": { "buckets": ["avatars", "documents"] },
"ai": { "models": [{"id": "gpt-4o", "capabilities": ["chat", "vision"]}] },
"hints": ["Use RPC for batch operations", "Storage accepts files up to 50MB"]
}In one call and around 500 tokens, the agent knows the full backend topology. The hints field provides agent-specific guidance that reduces incorrect API usage.
The key design choice here is that MCP is used for state inspection - which changes as the agent works - not for documentation retrieval, which does not change. This inverts the typical usage pattern and is the main reason InsForge consumes far fewer tokens than Supabase on equivalent tasks.
Practical test: building DocuRAG on two backends
To make this concrete, Avi Chawla built the same application using Claude Code on both backends and recorded the full session.
The application is called DocuRAG. Users sign in via Google OAuth, upload PDFs, the system chunks and embeds the text (text-embedding-3-small, 1536 dimensions), stores the vectors in pgvector, and users ask natural-language questions answered via GPT-4o.
This touches nearly every backend primitive at once: user auth, file storage, a documents table, vector embeddings, embedding generation, chat completion, a retrieval edge function, and RLS to isolate each user's documents.
Supabase setup
Step 1: Create an account and a new project. Step 2: Connect the MCP server to Claude Code:
claude mcp add --scope project --transport http supabase \
"https://mcp.supabase.com/mcp?project_ref=<your-project-ref>"
claude /mcpStep 3: Install Supabase's Agent Skills (marked as Optional in the official setup):
npx skills add supabase/agent-skillsThis installs two skills: supabase - a broad catch-all covering database, auth, edge functions, realtime, storage, vectors, cron, queues, client libraries, and SSR integrations; and supabase-postgres-best-practices covering Postgres performance optimization across 8 categories.
InsForge setup
Step 1: Create an account and a new project (you can also self-host locally using Docker Compose). Step 2: Install all four Skills:
npx skills add insforge/insforge-skillsThis installs insforge (SDK patterns), insforge-cli (infrastructure commands), insforge-debug (failure diagnostics), and insforge-integrations (third-party auth providers). Step 3: Link the CLI to your project:
npx @insforge/cli link --project-id <project-id>InsForge ships four narrowly scoped skills, each covering a specific domain. When writing frontend code, only insforge activates. When creating tables, only insforge-cli activates. When something breaks, only insforge-debug activates. Full skill content loads only for the one skill that matches the current task.
What happened in the Supabase session: 10.4M tokens, $9.21
The initial build went smoothly. The agent loaded the supabase skill, discovered the backend state via MCP tools, scaffolded the Next.js project, created the database schema, wrote two edge functions, and deployed everything. The build passed.
First problem: login did not work
When I tried to sign in with Google OAuth, the app threw an error. The agent had wired the authentication using the wrong Supabase client library for Next.js.
In Next.js, the OAuth callback runs on the server, but the agent used a client-side library that stores login state in the browser. The browser state is not available on the server, so the login flow broke.
The agent fixed this by switching to a different library (@supabase/ssr), rewriting how the app handles login sessions, and rebuilding.
Second problem: document upload took 8 rounds to fix
After the login was fixed, I tried uploading a document. The edge function returned an error. I reported it, it tried a fix, failed, then I tried again - same error. This cycle repeated 8 times:
- Agent tried adding auth headers manually - same error
- Redeployed with extra logging to see what was happening - same error
- Tried showing the real error message instead of the generic one - different error (now a network/CORS issue)
- Fixed the CORS issue - back to the original error
- Tried a different way of reading the user's login token - same error
- Tried yet another authentication approach - same error
After 8 failed attempts, the agent finally figured out what was going on: the 401 errors were happening at the platform's verify_jwt gate before the code even ran.
In plain terms: Supabase has a security layer that checks login tokens before the edge function code even starts. The new auth library installed by the agent - to fix the first problem - was sending a token format that this security layer did not recognize. So every request was being rejected at the door before the function code had a chance to run. That is why none of the code-level fixes worked.
The agent spent 8 rounds fixing code-level issues when the problem was upstream of the code entirely. The solution was simple: turn off the platform's automatic token checking and handle authentication inside the function code instead.
But during this debugging process, the edge function was redeployed 8 times. Each redeployment, log check, and retry re-sent the entire growing conversation history, compounding the token cost.
Final Supabase session stats: 12 user messages (10 were error reports), 135 tool calls, 30+ MCP tool calls, 10.4 million tokens, cost $9.21.
What happened in the InsForge session: 3.7M tokens, $2.81
The InsForge session completed without any errors that required intervention.
The agent started by inspecting the backend state. Its first action was npx @insforge/cli metadata --json, which returned a structured overview of the project - including configured auth providers, existing tables, storage buckets, available AI models, and realtime channels.
This gave the agent a complete picture of what it was working with before it wrote any code. In the Supabase session, the agent needed multiple MCP calls to piece together a similar understanding - and even then missed critical details like the verify_jwt behavior.
The schema setup ran through 6 CLI commands, all of which succeeded. The agent enabled pgvector, created the documents and chunks tables (with a vector(1536) column), enabled Row Level Security on both, created the access policies, and set up the match_chunks similarity search function. Each command returned structured output confirming what happened, so the agent could verify each step before moving to the next.
The auth and edge function problems from the Supabase session did not occur here. The insforge skill included the correct client library patterns for Next.js, so the agent wired authentication correctly on the first attempt.
The two edge functions - embed-chunks and query-rag - both deployed and ran without errors because the model gateway for embeddings and chat completion was part of the same backend. The agent did not need to integrate OpenAI separately, manage a second API key, or deal with cross-service authentication.
Final InsForge session stats: 1 user message, 77 tool calls, 0 MCP tool calls, 3.7 million tokens, cost $2.81.
Comparative analysis
The Supabase session's token cost was driven by the error retry loop. Each of the 8 edge function redeployments re-sent the entire growing conversation history. The agent checked logs 6 times, redeployed functions 8 times, and tried 6 different authentication strategies before finding the root cause.
None of this was the agent's fault. The Supabase platform's verify_jwt gate was rejecting the token before the function code ran, and the logs did not distinguish between platform-level and code-level rejections. Without that signal, the agent kept attempting to fix the code.
The InsForge session avoided these problems because the skills loaded the correct auth patterns from the start, the CLI gave structured feedback on every operation, and the model gateway meant there was no second service to integrate. The agent did not hit a single error that required debugging.
The core insight: your backend is a context problem
This comparison highlights a problem that goes beyond Supabase specifically.
Most backends were designed for human developers who can read dashboards, interpret ambiguous errors, and mentally track state across multiple services. When an agent takes over that workflow, the assumptions break.
The agent cannot see the dashboard. It cannot tell where an error came from if the logs do not say. And every time it guesses wrong, the token cost compounds.
InsForge is built around a different set of assumptions: the backend exposes its state through structured metadata, the CLI gives the agent programmatic control with clear success and failure signals, and the skills encode the correct patterns so the agent does not have to discover them through trial and error. The model gateway keeps LLM operations inside the same backend, removing the cross-service integration issues that caused most of the Supabase session's debugging.
If your agent is spending tokens discovering how your backend works, guessing at configurations, and retrying operations because error messages do not tell it what went wrong - you are paying for missing context.
The fix is not a better model or a longer context window. It is giving the agent structured information about your backend before it starts writing code.
That is context engineering applied to the backend. Karpathy said it right: filling the context window with the right information is the core skill. The insight from this experiment is that your backend infrastructure is one of the biggest sources of that context - and most of us are not treating it that way.
InsForge is fully open source under Apache 2.0 and can be self-hosted via Docker. The code, skills, and CLI are available on GitHub: github.com/InsForge/InsForge