LLM-Redactor: What Leaves Your Prompt When You Talk to a Cloud LLM

Every time a coding agent sends a prompt to a cloud LLM, the full content of that prompt — your code, your credentials, your customer names, your internal project codenames — lands on someone else’s server. It may be logged, retained for training, produced in response to subpoena, or exfiltrated in a breach. TLS protects the wire. Nothing protects the content.

We built LLM-Redactor to measure exactly how much leaks and what you can do about it. The paper evaluates eight techniques on a common benchmark. This post is the practitioner’s summary.

The eight options#

We identified eight distinct approaches to privacy-preserving LLM requests, spanning “never leave the device” to “homomorphic encryption”:

Option	What it does	Works today?
A Local-only	Run inference on a local model (Ollama). Nothing leaves.	Yes
B Redact	NER + regex finds PII/secrets, replaces with typed placeholders, restores on response.	Yes
C Rephrase	Local model rewrites the prompt to strip implicit identity.	Yes
D TEE	Forward to a Trusted Execution Environment (Nitro Enclave). Hardware attestation.	Partial
E Split inference	Run first layers locally, send only activations (not tokens).	Research
F FHE	Fully homomorphic encryption. Cloud computes on ciphertext.	Research
G MPC	Secret-share input across non-colluding servers.	Research
H DP noise	Calibrated word-level noise. Good for statistical workloads.	Yes (niche)

What we measured#

We built a benchmark of 1,300 synthetic prompts across four workload classes:

WL1 (PII): 500 samples with names, emails, SSNs, addresses
WL2 (Secrets): 300 samples with API keys, AWS creds, passwords, PEM keys
WL3 (Implicit identity): 200 samples like “the CFO whose wife works at the competitor”
WL4 (Code): 300 samples with internal function names, schemas, project codenames

Each sample has ground-truth annotations (4,014 total). We measure how many survive the pipeline — the combined leak rate (exact verbatim match + partial substring match).

The headline numbers#

Option	PII	Secrets	Implicit	Code
Baseline (no protection)	100%	100%	100%	100%
B (NER + regex)	15.3%	31.8%	95.0%	58.5%
B+C (redact + rephrase)	13.9%	31.6%	94.1%	55.8%
A (local routing)	6.3%	24.2%	46.8%	59.9%
A+B+C (recommended)	0.6%	6.4%	43.6%	31.3%

The combination A+B+C — route locally when possible, redact and rephrase the rest — achieves 0.6% combined leak on PII with zero exact leaks across 500 samples. That’s the best practical result in the entire matrix.

The implicit identity problem#

The hardest finding: implicit identity survives everything we throw at it. Phrases like “the CFO of Acme Corp whose wife works at the competitor” have no PII token to redact. Even the rephrase model preserves the structural relationships because they are the content of the prompt.

We built a semantic leak metric — a local judge model that reads the redacted text and determines if the original person/org is still identifiable. Results:

Option	Semantic leak rate
Baseline	95%
B (NER)	95%
B+C (rephrase)	100%

Content-level transformations can remove tokens but not meaning. For implicit identity, your options are: (a) never send it to the cloud (Option A), (b) use a TEE (Option D), or (c) accept the residual risk.

Using it#

LLM-Redactor works as an HTTP proxy (transparent to the agent) or an MCP server (explicit tools). Four modes:

1. Transparent proxy#

uv run llm-redactor serve --port 7789
export OPENAI_API_BASE=http://localhost:7789/v1
# Your agent now routes through the redactor automatically

2. MCP tools#

Add to your MCP config and the agent gets redact.scrub, redact.restore, redact.detect, and llm.chat tools:

{
  "mcpServers": {
    "llm-redactor": {
      "command": "uv",
      "args": [
        "--directory", "/path/to/llm-redactor",
        "run", "llm-redactor", "mcp",
        "--config", "/path/to/llm-redactor/examples/max-privacy.yaml"
      ]
    }
  }
}

The llm.chat tool is a drop-in: the agent calls it instead of the LLM directly, and scrub/restore happens internally. The agent never sees placeholders.

3. Pre-tool hook#

A Claude Code hook that warns when sensitive content is about to leave through any tool call — a safety net alongside any of the other modes.

4. Belt and suspenders#

Run the proxy AND load the MCP tools. Proxy catches everything silently; MCP tools available for the agent to inspect what was caught.

The detector#

35+ regex pattern families covering:

PII: email, phone (US + intl), SSN, IPv4/v6, credit cards
Cloud keys: AWS (access/secret/session), GCP, Azure
Vendor API keys: OpenAI, Anthropic, GitHub, GitLab, Slack, Stripe, Twilio, SendGrid, npm, PyPI
Generic: passwords, JWTs, bearer/basic auth, PEM/SSH/PGP keys, connection strings

Plus Presidio NER for person/org/location names, with a false-positive suppression layer (drug names, abbreviations, generic words) and an optional LLM validation pass that sends each NER span to a local model for KEEP/DROP verdict.

What it costs#

Redaction actually reduces token count by 4–12% (placeholders are shorter than emails and API keys). The latency overhead:

Option	Median latency
B (NER)	~20ms
B+H (DP noise)	~20ms
B+C (rephrase)	~2s
A (local routing)	~2s

For a utility evaluation, we ran a judge-model A/B comparison (n=50 per workload). The baseline response is preferred ~78% of the time — redaction does cost some quality, but for privacy-first workloads the trade-off is clear.

The paper#

The full evaluation with all eight options, four workloads, semantic leak analysis, epsilon sensitivity for DP noise, cross-family judge bias controls, and a decision rule for practitioners:

arXiv:2604.12064

Code, benchmarks, and configs: github.com/jayluxferro/llm-redactor

This is part of the LLM Agents series. Previously: Resilient Write.