RefCheckr Eval Framework — Benchmarking Evidence-Grounded AI

The question that matters

Can the model prove its answer from the source material?

Most AI evaluation asks whether a model produces a plausible answer. For a tool that medical writers rely on to verify clinical claims, plausibility is not enough. The failure modes that matter are subtle: a claim marked not supported because the supporting text was truncated; a slightly reworded claim verified in place of the original; a prompt change that quietly lowers verdict accuracy. None of these are visible without measurement.

The RefCheckr Eval Framework exists to make those failure modes into numbers we can track over time.

What it measures

The harness runs the verification pipeline against benchmark datasets of claims and supporting documents, then scores it on five families of metric:

Metric	What it captures
Citation accuracy	For each claim, are the right citation markers detected and mapped to the right reference? Scored by exact set-match and Jaccard overlap.
Claim extraction	Does the splitter pull the genuinely verifiable claims out of a manuscript section — and leave background, aims, and transitions out? Scored by precision, recall, and F1.
Factual support	Does each claim, checked against its cited reference, receive the correct verdict on the six-point support scale? Scored by exact accuracy and a more forgiving off-by-one (adjacency) accuracy.
Hallucination rate	Are supporting passages genuinely present in the source, and is every extracted claim verbatim from the manuscript? Anything that fails this check is counted as an infidelity.
Consistency	Run the same claim repeatedly — does the verdict hold? Stability is measured across repeats so non-determinism is quantified, not assumed away.

How it works

The framework is built like a test suite for an AI system — gold cases in, metrics out, a baseline to compare against.

Gold cases Each case is a hand-labelled manuscript section: the source text, the claims a reviewer should extract (with and without citation markers), the sentences that should not be extracted, and — for verdict scoring — reference snippets with known-correct verdicts.

Scorers Independent scorers run each metric family. offline scorers test the deterministic logic (citation parsing) with no API cost — fast enough for every commit. online scorers exercise the live endpoints for the model-dependent metrics.

Scorecard Every run produces a scorecard stamped with the exact code version, and a snapshot is written to disk so any result is reproducible and auditable.

Regression gate Each run is diffed against a saved baseline. Any metric that drops is flagged; in continuous integration the build fails. A prompt or model change cannot ship without first proving it didn't make the system less reliable.

# one command, no API key needed for the deterministic suite
$ npm run eval

RefCheckr evals — gold cases loaded
  ✓ citation-detection     PASS
      per-claim exact-match     100.0%
      supported-style accuracy  100.0%
      known-gap styles          author-year  (tracked target)
  ⊘ claim-extraction       online scorer (pass --online)
  ⊘ verdict-accuracy       online scorer (pass --online)

Honest by construction

Known gaps are tracked, not hidden. Where the system doesn't yet handle a case — for example author-year citations rather than numbered ones — the harness reports it as a named target rather than letting it pass silently. known gap
Fidelity is a first-class metric. It isn't enough for a verdict to look right; the framework checks that quoted evidence actually exists in the source and that extracted claims are verbatim from the manuscript.
Reproducible. Every scorecard records the code version it ran against, so a result can always be traced back to a specific state of the system.
Growing benchmark. The gold set is expanding across citation styles and all six verdict types; metrics become more meaningful as coverage grows.

Why it matters beyond RefCheckr

Teams shipping AI need repeatable, quantitative ways to compare models, validate improvements, and ensure systems stay trustworthy over time. For evidence-grounded AI — systems that must ground answers in source documents — that measurement layer is still largely missing.

The RefCheckr Eval Framework is purpose-built for RefCheckr, but its structure — gold sets, scorers for citation and factual support, fidelity and hallucination checks, and a regression gate — generalises to any system that has to prove its answers. It is the seed of a standalone toolkit for benchmarking evidence-grounded AI: helping developers build systems that are not only intelligent, but verifiably correct.

For questions about RefCheckr's evaluation methodology, contact support@pharmatools.ai. See also how RefCheckr verifies a claim in the AI Architecture overview, and how data is handled on the Security & Data Handling page.

Not just plausible — provable

The question that matters

What it measures

How it works

Honest by construction

Why it matters beyond RefCheckr