Eval Framework — RefCheckr by PharmaTools.AI
RefCheckr Eval Framework

Not just plausibleprovable

As AI moves from experimentation to production, evaluation is becoming as fundamental as testing is in traditional software. The RefCheckr Eval Framework measures how reliably an evidence-grounded system uses its sources — and gates every prompt, model, and workflow change against a baseline so reliability can't quietly regress.

The question that matters

Can the model prove its answer from the source material?

Most AI evaluation asks whether a model produces a plausible answer. For a tool that medical writers rely on to verify clinical claims, plausibility is not enough. The failure modes that matter are subtle: a claim marked not supported because the supporting text was truncated; a slightly reworded claim verified in place of the original; a prompt change that quietly lowers verdict accuracy. None of these are visible without measurement.

The RefCheckr Eval Framework exists to make those failure modes into numbers we can track over time.

What it measures

The harness runs the verification pipeline against benchmark datasets of claims and supporting documents, then scores it on five families of metric:

MetricWhat it captures
Citation accuracyFor each claim, are the right citation markers detected and mapped to the right reference? Scored by exact set-match and Jaccard overlap.
Claim extractionDoes the splitter pull the genuinely verifiable claims out of a manuscript section — and leave background, aims, and transitions out? Scored by precision, recall, and F1.
Factual supportDoes each claim, checked against its cited reference, receive the correct verdict on the six-point support scale? Scored by exact accuracy and a more forgiving off-by-one (adjacency) accuracy.
Hallucination rateAre supporting passages genuinely present in the source, and is every extracted claim verbatim from the manuscript? Anything that fails this check is counted as an infidelity.
ConsistencyRun the same claim repeatedly — does the verdict hold? Stability is measured across repeats so non-determinism is quantified, not assumed away.

How it works

The framework is built like a test suite for an AI system — gold cases in, metrics out, a baseline to compare against.

1
Gold cases Each case is a hand-labelled manuscript section: the source text, the claims a reviewer should extract (with and without citation markers), the sentences that should not be extracted, and — for verdict scoring — reference snippets with known-correct verdicts.
2
Scorers Independent scorers run each metric family. offline scorers test the deterministic logic (citation parsing) with no API cost — fast enough for every commit. online scorers exercise the live endpoints for the model-dependent metrics.
3
Scorecard Every run produces a scorecard stamped with the exact code version, and a snapshot is written to disk so any result is reproducible and auditable.
4
Regression gate Each run is diffed against a saved baseline. Any metric that drops is flagged; in continuous integration the build fails. A prompt or model change cannot ship without first proving it didn't make the system less reliable.
# one command, no API key needed for the deterministic suite $ npm run eval RefCheckr evals — gold cases loaded citation-detection PASS per-claim exact-match 100.0% supported-style accuracy 100.0% known-gap styles author-year (tracked target) ⊘ claim-extraction online scorer (pass --online) ⊘ verdict-accuracy online scorer (pass --online)

Honest by construction

  • Known gaps are tracked, not hidden. Where the system doesn't yet handle a case — for example author-year citations rather than numbered ones — the harness reports it as a named target rather than letting it pass silently. known gap
  • Fidelity is a first-class metric. It isn't enough for a verdict to look right; the framework checks that quoted evidence actually exists in the source and that extracted claims are verbatim from the manuscript.
  • Reproducible. Every scorecard records the code version it ran against, so a result can always be traced back to a specific state of the system.
  • Growing benchmark. The gold set is expanding across citation styles and all six verdict types; metrics become more meaningful as coverage grows.

Why it matters beyond RefCheckr

Teams shipping AI need repeatable, quantitative ways to compare models, validate improvements, and ensure systems stay trustworthy over time. For evidence-grounded AI — systems that must ground answers in source documents — that measurement layer is still largely missing.

The RefCheckr Eval Framework is purpose-built for RefCheckr, but its structure — gold sets, scorers for citation and factual support, fidelity and hallucination checks, and a regression gate — generalises to any system that has to prove its answers. It is the seed of a standalone toolkit for benchmarking evidence-grounded AI: helping developers build systems that are not only intelligent, but verifiably correct.

For questions about RefCheckr's evaluation methodology, contact support@pharmatools.ai. See also how RefCheckr verifies a claim in the AI Architecture overview, and how data is handled on the Security & Data Handling page.