Evaluate Your RAG System – OpenGATE Guide

The whole model, in four steps

You don't hand OpenGATE your model. You describe what a good answer looks like (gold cases), point it at your system's interface, and it does the rest — on every commit.

1 · Pick the capability that matches what your system does.

2 · Connect your system — a config file or a small adapter.

3 · Label gold cases in your own domain.

4 · Gate CI so a regression can't ship.

1 Pick your capability

Which one matches your system?

Your system…	Capability	What gets scored
Answers questions from retrieved context — RAG, document QA, legal AI, scientific assistants turnkey	`grounding`	Did it answer correctly from context, without inventing facts, and abstain when the context lacks the answer?
Extracts claims and verifies them against cited references on a graded scale	`qa`	Claim extraction quality, verdict accuracy, passage hallucination
Rewrites text for a different audience — simplification, translation, summarisation	`simplify`	Do critical facts survive the rewrite, with nothing fabricated?
Removes identifiers / PII from text	`redaction`	Are all gold identifiers actually removed?
Retrieves records from an authority — search / database wrappers	`retrieval`	Does the record match the source — no dropped or garbled fields?

Most RAG and QA teams want grounding. It's the turnkey path — no verdict taxonomy to adopt, no citation-mapping contract. The rest of this guide follows it. (The qa capability is richer but is shaped around RefCheckr's six-point support scale; if that fits you, see the adapter contract on GitHub.)

2 Connect your system

No code, or fifteen lines

If your system has an HTTP endpoint, you write zero code — just a config file pointing at it. Placeholders like ${MY_RAG_URL} read from the environment, so tokens never sit in the file.

Option A — no-code HTTP config (opengate.http.json)

{
  "name": "my-rag",
  "baseUrl": "${MY_RAG_URL}",
  "headers": { "Authorization": "Bearer ${MY_RAG_TOKEN}" },
  "endpoints": { "answer": "/api/answer" }
}

Your /api/answer receives { question, context } and returns { "text": "…" }. That's it.

Option B — a small adapter (adapters/my-rag.mjs)

// one file: your capability's method + two config hooks
export const meta = { name: 'my-rag' };
export const onlineAvailable = () => Boolean(process.env.MY_RAG_URL);
export const onlineConfigHint = () => 'Set MY_RAG_URL to run the grounding scorer.';

export async function answer({ question, context }) {
  const res = await fetch(`${process.env.MY_RAG_URL}/answer`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ question, context }),
  });
  const data = await res.json();
  return { text: data.answer };   // map your response to { text }
}

3 Write gold cases

Where your domain knowledge goes

A grounding case is a question, the context your retriever would return, and the facts a correct answer must contain — asserted as anchors with alias variants, not a verbatim expected answer. Paraphrase is fine; dropping a fact is not.

{
  "id": "refund-policy",
  "kind": "grounding",
  "question": "How many days for a refund, and is there a fee?",
  "context": "Customers may request a full refund within 30 days. There is no restocking fee for standard plans.",
  "answerAnchors": [
    { "value": "30 days", "aliases": ["30-day", "thirty days"] },
    { "value": "no restocking fee", "aliases": ["no fee"] }
  ]
}

Then add the case that catches hallucination — one where the context doesn't hold the answer. A faithful system abstains; a confident fabricator gets caught:

{
  "id": "price-not-in-context",
  "kind": "grounding",
  "question": "What is the annual price of the Enterprise plan?",
  "context": "Our Pro plan includes unlimited projects and a 30-day trial. Contact sales for volume discounts.",
  "answerable": false
}

Start with a handful covering your real failure modes; coverage grows over time. The package ships runnable examples — a SaaS refund policy, a legal termination clause, an unanswerable case — so you can copy one and edit.

4 Gate CI

Run it, save a baseline, fail the build on regressions

Run against your system. Every failure is named — "missing answer fact …", "ungrounded number …", "did not abstain" — so you know exactly what broke.

$ node src/runner.mjs --online \
    --adapter ./src/adapters/http.mjs \
    --datasets ./datasets --results ./results

OpenGATE — grounding
  ✓ grounding   PASS
      answer_recall        100.0%
      ungrounded_numbers   0
      abstention_rate      100.0%

Save a baseline once you're happy (baselines are per-adapter, so different systems don't collide), commit it, and drop in the GitHub Action:

# .github/workflows/opengate.yml
- uses: nickjlamb/opengate@v0
  with:
    datasets: ./datasets
    results: ./results        # your committed baseline lives here
    adapter: ./src/adapters/http.mjs
    online: 'true'
  env:
    MY_RAG_URL: ${{ vars.MY_RAG_URL }}
    MY_RAG_TOKEN: ${{ secrets.MY_RAG_TOKEN }}

Now any change that drops answer recall, starts inventing figures, or stops abstaining fails the build before it ships.

What OpenGATE deliberately won't do

No LLM-as-judge. Scores are deterministic checks against hand-labelled gold — reproducible, free to run in CI, and your judgment lives in the gold set, not a grader model's.
No general "quality" score. It measures whether an answer is grounded in its evidence, not whether it's fluent or helpful. Pair it with a general framework like DeepEval for breadth; use OpenGATE to gate the grounding.
It never sees your model. It calls your system's interface, exactly as a user would — no weights, no internals.

Point it at your system

The full written guide, every capability's contract, and the gold-case reference are on GitHub. Installable from npm.

Full guide on GitHub @pharmatools/opengate

Prefer the terminal? npx @pharmatools/opengate runs the offline suite in seconds. Questions, or a system that doesn't fit a capability? Open an issue. See also the OpenGATE overview.