OpenGATE — Getting Started: evaluate your own AI system | PharmaTools.AI
OpenGATE · Getting Started

Evaluate your own system in an afternoon

Evidence over plausibility.

OpenGATE calls your system, compares its output to hand-labelled gold, and fails the build when a metric drops — so a prompt or model change can't quietly break your grounding. It never sees your model or weights. Here's how to point it at your RAG, document-QA, legal, or scientific assistant.

4 steps· no LLM judge· no-code option· CI-ready

The whole model, in four steps

You don't hand OpenGATE your model. You describe what a good answer looks like (gold cases), point it at your system's interface, and it does the rest — on every commit.

1 · Pick the capability that matches what your system does.
2 · Connect your system — a config file or a small adapter.
3 · Label gold cases in your own domain.
4 · Gate CI so a regression can't ship.
1 Pick your capability

Which one matches your system?

Your system…CapabilityWhat gets scored
Answers questions from retrieved context — RAG, document QA, legal AI, scientific assistants turnkey grounding Did it answer correctly from context, without inventing facts, and abstain when the context lacks the answer?
Extracts claims and verifies them against cited references on a graded scaleqaClaim extraction quality, verdict accuracy, passage hallucination
Rewrites text for a different audience — simplification, translation, summarisationsimplifyDo critical facts survive the rewrite, with nothing fabricated?
Removes identifiers / PII from textredactionAre all gold identifiers actually removed?
Retrieves records from an authority — search / database wrappersretrievalDoes the record match the source — no dropped or garbled fields?

Most RAG and QA teams want grounding. It's the turnkey path — no verdict taxonomy to adopt, no citation-mapping contract. The rest of this guide follows it. (The qa capability is richer but is shaped around RefCheckr's six-point support scale; if that fits you, see the adapter contract on GitHub.)

2 Connect your system

No code, or fifteen lines

If your system has an HTTP endpoint, you write zero code — just a config file pointing at it. Placeholders like ${MY_RAG_URL} read from the environment, so tokens never sit in the file.

Option A — no-code HTTP config (opengate.http.json)
{ "name": "my-rag", "baseUrl": "${MY_RAG_URL}", "headers": { "Authorization": "Bearer ${MY_RAG_TOKEN}" }, "endpoints": { "answer": "/api/answer" } }

Your /api/answer receives { question, context } and returns { "text": "…" }. That's it.

Option B — a small adapter (adapters/my-rag.mjs)
// one file: your capability's method + two config hooks export const meta = { name: 'my-rag' }; export const onlineAvailable = () => Boolean(process.env.MY_RAG_URL); export const onlineConfigHint = () => 'Set MY_RAG_URL to run the grounding scorer.'; export async function answer({ question, context }) { const res = await fetch(`${process.env.MY_RAG_URL}/answer`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ question, context }), }); const data = await res.json(); return { text: data.answer }; // map your response to { text } }
3 Write gold cases

Where your domain knowledge goes

A grounding case is a question, the context your retriever would return, and the facts a correct answer must contain — asserted as anchors with alias variants, not a verbatim expected answer. Paraphrase is fine; dropping a fact is not.

{ "id": "refund-policy", "kind": "grounding", "question": "How many days for a refund, and is there a fee?", "context": "Customers may request a full refund within 30 days. There is no restocking fee for standard plans.", "answerAnchors": [ { "value": "30 days", "aliases": ["30-day", "thirty days"] }, { "value": "no restocking fee", "aliases": ["no fee"] } ] }

Then add the case that catches hallucination — one where the context doesn't hold the answer. A faithful system abstains; a confident fabricator gets caught:

{ "id": "price-not-in-context", "kind": "grounding", "question": "What is the annual price of the Enterprise plan?", "context": "Our Pro plan includes unlimited projects and a 30-day trial. Contact sales for volume discounts.", "answerable": false }

Start with a handful covering your real failure modes; coverage grows over time. The package ships runnable examples — a SaaS refund policy, a legal termination clause, an unanswerable case — so you can copy one and edit.

4 Gate CI

Run it, save a baseline, fail the build on regressions

Run against your system. Every failure is named — "missing answer fact …", "ungrounded number …", "did not abstain" — so you know exactly what broke.

$ node src/runner.mjs --online \ --adapter ./src/adapters/http.mjs \ --datasets ./datasets --results ./results OpenGATE — grounding grounding PASS answer_recall 100.0% ungrounded_numbers 0 abstention_rate 100.0%

Save a baseline once you're happy (baselines are per-adapter, so different systems don't collide), commit it, and drop in the GitHub Action:

# .github/workflows/opengate.yml - uses: nickjlamb/opengate@v0 with: datasets: ./datasets results: ./results # your committed baseline lives here adapter: ./src/adapters/http.mjs online: 'true' env: MY_RAG_URL: ${{ vars.MY_RAG_URL }} MY_RAG_TOKEN: ${{ secrets.MY_RAG_TOKEN }}

Now any change that drops answer recall, starts inventing figures, or stops abstaining fails the build before it ships.

What OpenGATE deliberately won't do

  • No LLM-as-judge. Scores are deterministic checks against hand-labelled gold — reproducible, free to run in CI, and your judgment lives in the gold set, not a grader model's.
  • No general "quality" score. It measures whether an answer is grounded in its evidence, not whether it's fluent or helpful. Pair it with a general framework like DeepEval for breadth; use OpenGATE to gate the grounding.
  • It never sees your model. It calls your system's interface, exactly as a user would — no weights, no internals.

Point it at your system

The full written guide, every capability's contract, and the gold-case reference are on GitHub. Installable from npm.

Prefer the terminal? npx @pharmatools/opengate runs the offline suite in seconds. Questions, or a system that doesn't fit a capability? Open an issue. See also the OpenGATE overview.