Redacta vs Presidio — PII Redaction Accuracy Benchmark

How accurate is Redacta?

A reproducible benchmark of Redacta's deterministic detection engine — the layer that ships in the iPhone app, the MCP server, the CLI and the libraries. It measures how reliably the engine redacts the identifiers it targets, and how well it avoids redacting what it shouldn't.

300 synthetic notes · 1,713 identifiers · vs Presidio · 10 seeds

Caught everything in scope. Redacted nothing it shouldn't.

RECALL

100%

1,713 / 1,713 in-scope identifiers caught.

PRECISION

100%

Every redaction was a real identifier — no false positives.

PRESERVED

100%

1,329 / 1,329 distractors (clinician names, dates, dosages) correctly kept.

The corpus is 300 synthetic UK clinical notes with 1,713 gold-labelled identifiers, plus 1,329 "preserve" distractors designed to tempt a naive redactor. It is run against the shipping engine in clinical mode. All data is synthetic — no real patient information is used.

Stable across seeds. Re-run on 10 independent corpora — 17,227 identifiers in all — the result doesn't budge: 100% recall, 0 false positives, 100% preservation, every time. Zero variance. That's what "deterministic" buys you.

Redacta vs. Microsoft Presidio

A number means more next to a baseline. We ran Microsoft Presidio — the most widely used open-source PII detector — over the identical corpus, scored by the identical rule (presidio-analyzer 2.2.362, default recognisers, en_core_web_lg 3.7). The gap isn't really about finding identifiers. It's about not shredding the record around them.

Redacta Presidio (default)
Identifiers found · any label, higher is better
Redacta
100%
Presidio
89.7%
Strict recall · found with the correct label, higher is better
Redacta
100%
Presidio
77.3%
Clinical context preserved · distractors kept, higher is better
Redacta
100%
Presidio
32.8%
False positives · things wrongly redacted, lower is better  (scale 0–1,643)
Redacta
0
Presidio
1,643

Presidio finds most identifiers — but it also flags 1,643 things that aren't personal data: every clinician's name, every appointment date. It keeps barely a third of the context a clinical record needs to stay readable. And it has no recogniser for UK National Insurance numbers (32% found) or postcodes (38%) — identifiers Redacta is purpose-built for. On the same text, Redacta made zero false positives.

This is the trade a general-purpose detector makes, and why a tuned, deterministic engine matters for clinical text: over-redaction isn't free — it destroys the meaning you were trying to preserve.

Detection by identifier type

Identifier Tested Recall
NHS number · Modulus-11 validated300100%
Patient name · title / salutation / label300100%
Date of birth · keyword anchored247100%
Relative / next-of-kin name154100%
Phone number · UK / US175100%
Postcode · UK170100%
MRN / hospital number130100%
National Insurance number105100%
Email132100%

NHS numbers were tested in spaced, unspaced and dashed forms; dates of birth were introduced by varied keywords (Date of birth, DOB:, D.O.B., Born on).

Against Presidio, both engines reach 100% on NHS numbers, patient and relative names, dates, phone and email. They split on the two UK-specific identifiers Presidio has no recogniser for:

Redacta Presidio (default)
National Insurance number · found
Redacta
100%
Presidio
32%
UK postcode · found
Redacta
100%
Presidio
38%

What it must not redact

A redactor that flags everything is useless — it destroys the clinical record. Each note is seeded with false-positive bait. The engine correctly preserved all 1,329:

KEPT

Invalid identifiers

NHS numbers that fail the Modulus-11 checksum, and NI numbers with invalid prefixes — proof it isn't redacting "any 10 digits".

KEPT

Clinical context

Appointment dates (no DOB keyword) and clinician names — only the patient is removed, so the record keeps its meaning.

KEPT

Numbers & doses

Lab values and dosages like 200 mg and Ferritin 23 ug/L are left untouched.

Small enough to run where the data is

Accuracy is only half the story. Redacta is a 15 KB deterministic engine — no model to download, no server to call. That's what lets it run inside the iPhone app and the browser, so the text never leaves the device. A model-based detector has to load hundreds of megabytes and, in practice, runs server-side — which means sending the unredacted text away to redact it.

FOOTPRINT

~15 KB

The whole engine — about 5 KB gzipped. Presidio's en_core_web_lg model is ~560 MB: roughly 37,000× larger.

SPEED

0.04 ms

Per note — about 25,000 notes a second on one core. Pure regex and checksums, so there's no model load and no cold start.

WHERE IT RUNS

On-device

In the app and the browser, with zero network calls. Presidio needs a Python runtime and the model in memory — so the text leaves the device.

Engine size, to scale (0 – 560 MB)

Redacta
15 KB
Presidio
560 MB

Redacta's bar is the hairline on the left — 15 KB is a rounding error next to a 560 MB model. Speed measured with Node on a laptop-class core; on-device in the app it runs in JavaScriptCore. Presidio figures are the en_core_web_lg model footprint, run as shipped.

How the benchmark works

01

Generate

A seeded generator builds 300 synthetic notes with programmatically-valid identifiers (e.g. checksum-valid NHS numbers) and records the gold label for each.

02

Run

Each note goes through the shipping engine in clinical mode — the same code as the iPhone app — and the whole thing repeats across 10 independent seeds.

03

Score

Recall, precision and preservation are computed from the returned token map, comparing on alphanumerics so spacing and dashes don't affect scoring.

It's fully reproducible — fixed seed, one command. The corpus is exported to corpus.json so the Presidio baseline scores on byte-identical input with the same rule (presidio_baseline.py).

$node benchmark/benchmark.mjs

Limitations — please read

These numbers describe the engine within its deterministic scope. They are high because the engine is regex- and checksum-based and near-exhaustive on the patterns it targets — the benchmark verifies that and probes the boundaries. The real limits:

Free-text names are out of scope. A name with no anchoring title, salutation, label or relationship word — "Patricia returned today…" — is not caught (0 of 50 in this run). This is by design for an on-device deterministic engine: the app prompts you to review, and the agent-skill layer adds AI reasoning for these cases.

It over-redacts in ambiguous cases — the safe direction. A 10-digit reference number formatted like a phone number, or "Sister [Name]" (a UK nurse title the engine reads as a relative), can be redacted even when it isn't personal data. Erring toward redaction is the safer error.

The corpus is Redacta's home turf. It's UK clinical text, the domain Redacta is tuned for, so a perfect score is expected — that's why the Presidio comparison matters more than the absolute number. Presidio is general-purpose and was run as shipped (default recognisers); a clinical-specific configuration would close some of the gap, though not the UK-identifier coverage.

Synthetic data is tidier than reality. Real letters carry OCR noise and unusual layouts, so real-world recall — especially for names — will be lower. Redacta is a strong first line of defence, not a guarantee: always review the output before sharing text.

Redaction you can verify

Try Redacta on your own text — free on the App Store — or read the open-source engine and benchmark on GitHub.