OpenGATE

Philosophy

Can the model prove its answer from the source material?

Modern AI evaluation often measures whether an answer sounds plausible. OpenGATE instead measures whether an answer is faithfully grounded in evidence. The failure modes that matter in production are subtle: a supporting passage that doesn't exist in the source; a slightly reworded claim verified in place of the original; a prompt change that quietly lowers accuracy. None of these are visible without measurement.

Evidence matters more than plausibility. That distinction is decisive wherever an ungrounded answer carries real cost:

HealthcareClinical claims verified against literature and labels

Scientific publishingCitation-backed statements in manuscripts

RegulatorySubmissions grounded in source documentation

LegalArguments traceable to statute and precedent

FinanceAnalysis grounded in filings and data

Enterprise RAGAnswers grounded in retrieved documents

Not a benchmark. Infrastructure.

OpenGATE is not another generic LLM benchmark. It is an engineering framework for continuously measuring and improving evidence-grounded AI systems: benchmark datasets, regression testing, prompt evaluation, model comparison, pipeline comparison, CI/CD quality gates, and quantitative metrics. The goal is that every model, prompt, or workflow change can be automatically evaluated before deployment — the same discipline a test suite brings to conventional software.

OpenGATE

↓ powers ↓

RefCheckrevidence verification · first implementation

Patiently AIfaithfulness evaluation · planned

Redactaplanned

future systemsRAG · document QA · enterprise search

What it measures

OpenGATE runs a system's pipeline against benchmark datasets of claims and supporting documents, then scores it on five families of metric:

Metric family	What it captures
Citation accuracy	Are the right citation markers detected and mapped to the right reference? Scored by exact set-match and Jaccard overlap.
Claim extraction	Does the system pull the genuinely verifiable claims out of a document — and leave background and transitions out? Scored by precision, recall, and F1.
Factual support	Does each claim, checked against its cited source, receive the correct verdict on a graded support scale? Scored by exact and adjacency accuracy.
Hallucination rate	Are supporting passages genuinely present in the source, and are extracted claims verbatim? Anything that fails is counted as an infidelity.
Consistency	Run the same input repeatedly — does the verdict hold? Non-determinism is quantified, not assumed away.

How it works

The framework is built like a test suite for an AI system — gold cases in, metrics out, a baseline to compare against.

Gold cases Hand-labelled benchmark cases: source text, the claims that should be extracted, the sentences that should not be, and reference snippets with known-correct verdicts.

Scorers Independent scorers per metric family. offline scorers test deterministic logic with no API cost — fast enough for every commit. online scorers exercise live endpoints for the model-dependent metrics.

Scorecard Every run produces a scorecard stamped with the exact code version; a snapshot is written to disk so any result is reproducible and auditable.

Regression gate Each run is diffed against a saved baseline. Any metric that drops is flagged; in continuous integration the build fails. A prompt or model change cannot ship without proving it didn't make the system less reliable.

# one command, no API key needed for the deterministic suite
$ npm run eval

OpenGATE — gold cases loaded
  ✓ citation-detection     PASS
  ⊘ claim-extraction       online scorer (pass --online)
  ⊘ verdict-accuracy       online scorer (pass --online)

First implementation: RefCheckr

OpenGATE powers the evaluation infrastructure behind RefCheckr, an evidence verification tool for medical writers. The framework isn't hypothetical — run against RefCheckr's gold set, it surfaced real reliability issues, measured the fixes, and drove a production model change on evidence rather than reputation.

Parse failures

~50% → 0

A silent failure mode found by the harness, eliminated with enforced structured output.

Passage hallucination

5.8% → 2.4%

Cut by more than half — the eval drove RefCheckr's production model switch.

Claim extraction

~0.95 F1

Near-full recall on the gold set, with a low rate of non-verbatim claims.

Full methodology, model comparisons across accuracy, hallucination, latency, and cost, and reproducibility notes are on the RefCheckr evaluation page.

Beyond RefCheckr

Although originally developed for RefCheckr, OpenGATE can evaluate any evidence-grounded AI workflow. The structure — gold sets, per-metric scorers, fidelity and hallucination checks, reproducible scorecards, and a regression gate — generalises to any system that has to prove its answers: retrieval-augmented generation, document QA, medical AI, legal AI, regulatory AI, enterprise search, and scientific assistants. The methodology travels; only the gold set changes.

Within the PharmaTools AI ecosystem, OpenGATE sits alongside PubCrawl as shared infrastructure: PubCrawl handles biomedical retrieval, OpenGATE handles evaluation, and applications like RefCheckr and Patiently AI build on both.

Open source

OpenGATE is open source. The aim is transparency, reproducibility, and community contribution around AI evaluation: anyone can read the scorers, inspect the gold-set format, rerun a scorecard, and audit how a number was produced. The framework ships with a reference adapter for RefCheckr and a hand-labelled biomedical gold set that doubles as a worked example of the case format.

Evaluation you can audit

Read the scorers, the gold-set format, and the regression gate — or see the framework applied in production.

View on GitHub See it applied: RefCheckr

OpenGATE is on GitHub. For questions about the framework or evaluation methodology, contact support@pharmatools.ai. See the framework in production on the RefCheckr evaluation page.

Evaluation for AI that has to prove its answers