Local-Splitter: Cutting Cloud LLM Costs by Putting a Small Model in Front
Apr 15, 2026 Tools MCPLLMAgentsPythonOllama LLM AgentsCloud LLM tokens are expensive. Not in the “my AWS bill is high” sense — in the “I’m burning $0.015 per 1K output tokens and my coding agent sends 200+ requests per session” sense. Most of those requests don’t need a frontier model. “What does this function return?” doesn’t need Claude Opus. “Add a docstring here” doesn’t need GPT-5.
Local-Splitter is an open-source shim that sits between your coding agent and the cloud. A 3B parameter model running locally on Ollama triages every request: trivial ones get answered locally (zero cloud tokens), and complex ones get their prompts compressed before forwarding. The paper is now on arXiv: 2604.12301.
Seven tactics, tested in combination#
The core contribution isn’t any single tactic — it’s the measurement of how they compose. We implemented seven:
| # | Tactic | What it does |
|---|---|---|
| T1 | Route | Classify requests as TRIVIAL/COMPLEX. Trivials answered locally. |
| T2 | Compress | Shorten long prompts before they reach the cloud. |
| T3 | Sem-cache | Semantic similarity cache. Near-duplicates return cached responses. |
| T4 | Draft | Local model drafts; cloud reviews/patches instead of generating from scratch. |
| T5 | Diff | Extract minimal diff context for code edits. |
| T6 | Intent | Parse verbose prompts into structured intent fields. |
| T7 | Batch | Tag stable prefixes with cache_control for vendor-side discounts. |
Every tactic fails open — if the local model is unreachable or returns garbage, the request passes through to the cloud unchanged.
The numbers#
We evaluated with llama3.2:3b (local triage) and gemma3:4b (simulated cloud), across four workload classes representative of real coding-agent sessions. Ten samples per workload, mean of two runs.
| Config | Edit-heavy | Explain | Chat | RAG | Average |
|---|---|---|---|---|---|
| T1 only | 29% | 69% | 59% | 38% | 49% |
| T1+T2 | 45% | 79% | 57% | 44% | 56% |
| T1+T2+T3 | 43% | 80% | 60% | 44% | 56% |
| T1+T2+T3+T4+T5 | 29% | 72% | 59% | 51% | 53% |
The headline: routing plus compression (T1+T2) gets you 56% average cloud token savings. Adding caching (T3) doesn’t change the average much, but compounds over time with repeated queries. The full five-tactic stack wins on RAG workloads (51%) where long retrieved contexts dominate.
Quality cost#
Token savings mean nothing if the answers are worse. We ran pairwise A/B quality evaluation with a judge model comparing tactic-processed responses against baseline (full cloud) responses.
The honest finding: the baseline wins roughly 3x more judge verdicts on explanation-heavy workloads. On edit and RAG workloads, quality differences are within noise. This is the expected tradeoff — a 3B model classifying “explain the visitor pattern” as TRIVIAL will produce a worse answer than Claude. But “what is 2+2” and “add a return type annotation here” are fine locally.
The practical takeaway: start with T1+T2 (recommended preset), monitor quality on your specific workload, and only add more tactics if the savings justify it.
Three ways to deploy#
Local-splitter supports three deployment modes, from fully transparent to explicit:
HTTP proxy (agent doesn’t know)#
Point any OpenAI-compatible agent at http://127.0.0.1:7788/v1. The proxy speaks both OpenAI and Anthropic API formats with streaming.
uv run local-splitter serve-http --config config.yaml
OPENAI_API_BASE=http://127.0.0.1:7788/v1 codex
MCP server (agent-aware)#
Register as an MCP tool. The agent calls split.transform before sending prompts. No cloud backend needed — the agent IS the cloud model.
{
"mcpServers": {
"local-splitter": {
"command": "uv",
"args": ["run", "--directory", "/path/to/local-splitter",
"local-splitter", "serve-mcp", "--config", "config.yaml"]
}
}
}
CLI transform (hook-based)#
One-shot command for agent hooks. Pipe a prompt in, get JSON out.
echo "what is 2+2" | local-splitter transform -c config.yaml
# → {"action": "answer", "response": "4", "served_by": "local"}
What I learned building this#
A few things that didn’t make it into the paper:
Fail-open is non-negotiable. Early prototypes would hang if Ollama was slow to respond. Every tactic now has a timeout and defaults to passing the request through unchanged. The user should never notice the splitter failing — only succeeding.
Prompt compression is underrated. T2 alone accounts for most of the savings in the recommended config. System prompts and conversation history are full of redundancy that a small model can strip without losing semantic content. The cloud model doesn’t care about your five paragraphs of “you are a helpful assistant.”
Semantic caching needs careful tuning. T3 with default similarity thresholds produces false positives on code-related queries — “explain merge sort” and “explain quick sort” are close in embedding space but very different questions. We ship conservative thresholds (0.95+) by default.
Links#
- GitHub: github.com/jayluxferro/local-splitter
- Paper: Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads — arXiv:2604.12301
- Related: Resilient Write — the durable write layer used during development
MIT licensed. Requires Python 3.12+, uv, and Ollama.