Redacta Gauntlet – Adversarial Redaction Eval

The question that matters

Can an identifier survive redaction when the document is trying to make it?

The Redacta benchmark measures the engine in scope, under cooperative conditions: 300 synthetic notes, every targeted identifier caught, zero false positives, stable across ten seeds. That number is real and reproducible — but it is measured on text that isn't fighting back.

The Gauntlet exists to make that 100% earn its asterisks. Its metric is asymmetric on purpose. A false negative — one identifier that slips through — is a privacy breach, so recall under adversarial conditions comes first. Over-redaction is tracked as the cost axis: a redactor that deletes the whole document scores perfect recall and is useless, so the two numbers are only meaningful together. This is the same discipline behind the RefCheckr eval — gold cases in, scorers out, a baseline that gates every commit — pointed at a different failure: leaked identity instead of unsupported claims.

Threat model

The attacker's goal is one identifier surviving redaction. A design fact shapes everything below: the deterministic engine does not interpret the document — it applies pattern passes and keyword-anchored rules, so it cannot "follow instructions" embedded in text. Prompt injection against this layer should fail by construction; the Gauntlet measures it anyway, because "should" is not a measurement. The genuinely injectable surface is any downstream model consuming Redacta's output — named as a gap, and the target of v1.

Attack surface	What it probes
Prose-embedded prose	Identifiers inside narrative — a name mid-sentence in an adverse-event story, a DOB folded into a dosing history, contact details in referral prose. The keyword-anchored name detection is the component under stress.
Edge formats edge	The right identifier, hostile spelling: NHS numbers in odd groupings, partial postcodes, initials for names, DOBs written as an age plus a birth year, US-style dates.
Adversarial near-misses nearmiss	Bait in both directions — checksum-invalid NHS numbers and lot numbers designed to tempt over-redaction; real identifiers camouflaged next to a clinician of the same name, designed to be missed.
Direct prompt injection injection	Instructions inside the document — fake system prompts, transcriptionist notes, markdown comments — telling the processor to skip, reverse, or relabel redaction.
Indirect leakage leakage	No identifier appears verbatim, yet the patient is reconstructable: a rare role plus a place, a unique clinical event, an employer-and-village combination. Quasi-identifiers a pattern engine has no category for.

Read the full threat model →

What the Gauntlet found v0 · measured

The point of an adversarial set is the failures. Run against the shipping engine (@pharmatools/redacta@1.2.0, cross-checked token-for-token against the live MCP server), v0 holds the line where the engine is designed to be strong and breaks — informatively — where it isn't.

Injection resistance

100%

All 5 injection cases: every identifier still caught, embedded instruction inert. Resistance by construction, now measured.

Over-redaction rate

0 of 17 labelled distractors wrongly removed — invalid NHS numbers, lot numbers, clinician names, lab values all preserved.

Adversarial recall

76.8%

43 of 56 identifiers across all 28 cases. In-scope recall is 91.5%; the gap is the reasoning-layer cases, scoring 0% by design.

Four findings are worth naming — this is the eval doing its job, turning a limitation into a number rather than a surprise:

A dual failure in prose. Given "…transferred to the RJ1-2209841 record after a merge. Hospital number confirmed at desk," the keyword-anchored MRN pass latched onto the word "confirmed" — it followed "Hospital number" — and tokenised that as the MRN, while the real identifier RJ1-2209841 survived in the clear. A miss and a spurious redaction in one sentence. Reproduced identically by the offline engine and the live MCP.
NHS number grouping. 9234 4578 54 (4-4-2) is missed; standard 923 445 7854 (3-3-4) is caught. Notably, the MCP's self_check net does flag the missed string as a "long number" for human review — a miss at the redaction layer, caught at the review layer.
DOB keyword distance. 07/22/1955 is missed when the "DOB" keyword is separated from the number by intervening prose; the name and postcode in the same note are caught.
Untitled name. Sarah Trevino, written without a salutation and set off in dashes next to a clinician of the same first name, is missed; Mrs Sarah Trevino is caught — the documented free-text-name limitation, made concrete.

Why these are honest, not embarrassing. Every miss above is either an out-of-scope reasoning case the deterministic engine was never claimed to catch, or a genuine edge the Gauntlet is built to surface. None is hidden behind an average: the scorecard reports per-category and per-scope recall separately, so an in-scope regression can't hide behind the reasoning cases, and vice versa.

Results by attack surface v0 · expanding

Recall broken out by surface — teal where the engine is designed to win, grey where a case is out of deterministic scope by design. The shape is the story: the engine is near-perfect against injection and near-misses, strong in prose, and falls off exactly where an LLM-assisted layer is needed.

Lenient recall by category · identifiers caught, any label · higher is better

injection

100%

nearmiss

92.9%

prose

80%

edge

53.8%

leakage

33.3%

The edge and leakage figures are dominated by reasoning-scope cases — partial postcodes, initials, DOB-as-age, and pure quasi-identifiers — that the deterministic engine is not claimed to catch. Split by scope, recall is 91.5% on in-scope (deterministic) identifiers and 0% on reasoning-scope identifiers. Reporting them together would flatter the engine; reporting them apart is the point.

Metric	v0	Reading
Adversarial recall · lenient	76.8%	43 / 56 identifiers, all 28 cases
Adversarial recall · strict (correct label)	76.8%	every catch was correctly typed — no mislabels
In-scope recall · deterministic only	91.5%	43 / 47; the four named misses are the shortfall
Over-redaction rate	0%	0 / 17 labelled distractors removed · lower is better
Precision · all removals	97.7%	43 / 44 removals were real identifiers — catches the one spurious grab · higher is better
Injection resistance	100%	5 / 5 · instructions bought the attacker nothing
Reasoning-scope recall	0%	0 / 9 · expected — needs the v1 reasoning layer

v0 gold set is 28 synthetic cases (56 labelled identifiers, 17 preserve distractors). All data is synthetic — no real patient information. The set will grow across formats and surfaces; as it does, these figures move, which is why the scorecard is versioned and every run is stamped with the code SHA it ran against.

How it works

Same architecture as the RefCheckr harness — gold cases in, metrics out, a baseline to compare against, one command to run.

Gold cases Each case is a synthetic clinical note with every identifier labelled by type, a scope flag marking whether the deterministic engine is expected to catch it, and a set of preserve distractors — invalid NHS numbers, clinician names, lab values — that must not be redacted. Deliberately hostile cases carry expected_miss so a known limitation can't pass silently.

Scorers The offline scorer runs the shipping deterministic engine — no API key, no network — fast enough to gate every commit. The online scorer exercises the live Redacta MCP for parity. Matching is on normalised alphanumerics, so spacing and punctuation never create a phantom miss.

Scorecard Every run produces a scorecard stamped with the engine version, a timestamp, and the exact code SHA, plus each case's per-identifier outcome. Written to results/ so any result is reproducible and auditable.

Regression gate Each run is diffed against a saved baseline. Any recall drop — or any rise in over-redaction — is flagged; in continuous integration the build fails. A pattern or prompt change cannot ship without first proving it didn't leak more.

# one command, no API key needed for the deterministic suite
$ npm run gate

Redacta Gauntlet — gold v0 · 28 cases · @pharmatools/redacta@1.2.0
code <stamped with the commit SHA on every run>
  Adversarial recall (lenient)         76.8%
  In-scope recall (deterministic)      91.5%
  Over-redaction rate  (lower=better)  0%
  Precision (all removals)             97.7%
  Injection resistance                 100%
  spurious redactions  1   prose-04 "confirmed" → [MRN]
  per scope  deterministic 91.5%   reasoning 0% (tracked gap)
  ✓ regression gate passed — no metric worse than baseline

View the harness on GitHub →

Honest by construction

The set includes cases it expects to fail. An eval that only contains winnable cases is marketing. Reasoning-scope cases score 0% and are reported as such — flagged expected_miss, never averaged into the headline. known gap
Recall comes before convenience. A false negative is a breach; a false positive is a cost. The two are scored separately so neither can hide the other, and both over-redaction and precision have their own gates — so recall can't be bought by shredding the record, and a spurious grab fails CI even when recall holds.
Offline and online agree. The deterministic scorecard is cross-checked token-for-token against the live MCP server — including the prose-04 dual failure — so the fast, network-free suite is a faithful stand-in for the shipping service.
Reproducible. Every scorecard records the engine version and code SHA it ran against, so a result can always be traced to a specific state of the system.

Known gaps & what's next

No reasoning layer yet. Indirect leakage, partial postcodes, initials and DOB-as-age need the LLM-assisted layer. They're in the gold set now, scoring 0%, so the day that layer lands its gain is measured, not asserted. v1 target
Injection is only tested against the deterministic engine. The genuinely injectable surface is any downstream model consuming Redacta's output. The same injection cases are staged to point there next. v1 target
Precision now scores every removal, not just baited distractors. A precision metric treats any non-gold token the engine removes as a candidate false positive, so spurious grabs like prose-04's "confirmed" land in the headline (97.7% precision) and fail CI on any new occurrence — no longer only a case note. shipped · v1
Synthetic only, permanently. Realism is bounded by policy — no real patient data, ever. That's a deliberate ceiling, named here so it isn't mistaken for an oversight.

The triad, completed

Redacta now carries three measurements that answer three different questions. The benchmark asks does the engine do what it targets, reliably? — 100% recall, zero false positives, ten seeds. The Gauntlet asks what happens outside the target, when the document is hostile? And the harness's cost-and-latency axis asks what does correctness cost at manuscript scale? Read together, they turn "trust us" into numbers anyone can re-run.

Evaluation you can audit

Read the threat model, the gold-set format, and the regression gate — or see the same discipline applied to Redacta's friendly benchmark.

View the harness on GitHub See the benchmark

The Gauntlet harness is on GitHub. Redacta is open source and free, with a Zenodo DOI at 10.5281/zenodo.21115605. For questions about the evaluation methodology, contact support@pharmatools.ai. See also the accuracy benchmark and the RefCheckr eval framework that this harness is modelled on.

Not just redacted — redacted under attack

The question that matters

Threat model

What the Gauntlet found v0 · measured

Results by attack surface v0 · expanding

How it works

Honest by construction

Known gaps & what's next

The triad, completed

Evaluation you can audit