Cloud LLM tokens are expensive. Not in the “my AWS bill is high” sense — in the “I’m burning $0.015 per 1K output tokens and my coding agent sends 200+ requests per session” sense. Most of those requests don’t need a frontier model. “What does this function return?” doesn’t need Claude Opus. “Add a docstring here” doesn’t need GPT-5.

Local-Splitter is an open-source shim that sits between your coding agent and the cloud. A 3B parameter model running locally on Ollama triages every request: trivial ones get answered locally (zero cloud tokens), and complex ones get their prompts compressed before forwarding. The paper is now on arXiv: 2604.12301.

Seven tactics, tested in combination#

The core contribution isn’t any single tactic — it’s the measurement of how they compose. We implemented seven:

#TacticWhat it does
T1RouteClassify requests as TRIVIAL/COMPLEX. Trivials answered locally.
T2CompressShorten long prompts before they reach the cloud.
T3Sem-cacheSemantic similarity cache. Near-duplicates return cached responses.
T4DraftLocal model drafts; cloud reviews/patches instead of generating from scratch.
T5DiffExtract minimal diff context for code edits.
T6IntentParse verbose prompts into structured intent fields.
T7BatchTag stable prefixes with cache_control for vendor-side discounts.

Every tactic fails open — if the local model is unreachable or returns garbage, the request passes through to the cloud unchanged.

The numbers#

We evaluated with llama3.2:3b (local triage) and gemma3:4b (simulated cloud), across four workload classes representative of real coding-agent sessions. Ten samples per workload, mean of two runs.

ConfigEdit-heavyExplainChatRAGAverage
T1 only29%69%59%38%49%
T1+T245%79%57%44%56%
T1+T2+T343%80%60%44%56%
T1+T2+T3+T4+T529%72%59%51%53%

The headline: routing plus compression (T1+T2) gets you 56% average cloud token savings. Adding caching (T3) doesn’t change the average much, but compounds over time with repeated queries. The full five-tactic stack wins on RAG workloads (51%) where long retrieved contexts dominate.

Quality cost#

Token savings mean nothing if the answers are worse. We ran pairwise A/B quality evaluation with a judge model comparing tactic-processed responses against baseline (full cloud) responses.

The honest finding: the baseline wins roughly 3x more judge verdicts on explanation-heavy workloads. On edit and RAG workloads, quality differences are within noise. This is the expected tradeoff — a 3B model classifying “explain the visitor pattern” as TRIVIAL will produce a worse answer than Claude. But “what is 2+2” and “add a return type annotation here” are fine locally.

The practical takeaway: start with T1+T2 (recommended preset), monitor quality on your specific workload, and only add more tactics if the savings justify it.

Three ways to deploy#

Local-splitter supports three deployment modes, from fully transparent to explicit:

HTTP proxy (agent doesn’t know)#

Point any OpenAI-compatible agent at http://127.0.0.1:7788/v1. The proxy speaks both OpenAI and Anthropic API formats with streaming.

uv run local-splitter serve-http --config config.yaml
OPENAI_API_BASE=http://127.0.0.1:7788/v1 codex

MCP server (agent-aware)#

Register as an MCP tool. The agent calls split.transform before sending prompts. No cloud backend needed — the agent IS the cloud model.

{
  "mcpServers": {
    "local-splitter": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/local-splitter",
               "local-splitter", "serve-mcp", "--config", "config.yaml"]
    }
  }
}

CLI transform (hook-based)#

One-shot command for agent hooks. Pipe a prompt in, get JSON out.

echo "what is 2+2" | local-splitter transform -c config.yaml
# → {"action": "answer", "response": "4", "served_by": "local"}

What I learned building this#

A few things that didn’t make it into the paper:

Fail-open is non-negotiable. Early prototypes would hang if Ollama was slow to respond. Every tactic now has a timeout and defaults to passing the request through unchanged. The user should never notice the splitter failing — only succeeding.

Prompt compression is underrated. T2 alone accounts for most of the savings in the recommended config. System prompts and conversation history are full of redundancy that a small model can strip without losing semantic content. The cloud model doesn’t care about your five paragraphs of “you are a helpful assistant.”

Semantic caching needs careful tuning. T3 with default similarity thresholds produces false positives on code-related queries — “explain merge sort” and “explain quick sort” are close in embedding space but very different questions. We ship conservative thresholds (0.95+) by default.

MIT licensed. Requires Python 3.12+, uv, and Ollama.