Skip to main content

CI Quality Gates

Quality gates enforce evaluation score thresholds on every pull request. When scores drop below your configured threshold, the gate fails the check and posts a detailed comment to the PR. This prevents regressions from merging undetected.

How the Gate Works

1. Create experiment
└─ Runs your evaluator(s) against your dataset

2. Poll for completion
└─ Exponential backoff until the experiment finishes

3. Compare scores
├─ Check average score >= threshold
└─ Optionally diff against a baseline experiment

4. Post PR comment
└─ Per-evaluator scores, pass/fail badge, comparison URL

5. Enforce
└─ Exit 1 if threshold not met — fails the workflow step

GitHub Actions Setup

Add the Foxhound quality gate to your workflow using the composite action at .github/actions/quality-gate.

Minimal Configuration

name: AI Quality Gate

on:
pull_request:
branches: [main]

permissions:
pull-requests: write

jobs:
quality-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Run Foxhound Quality Gate
uses: ./.github/actions/quality-gate
with:
api-key: ${{ secrets.FOXHOUND_API_KEY }}
api-endpoint: https://api.foxhound.dev
dataset-id: ds_abc123
experiment-config: |
{
"model": "gpt-4o",
"temperature": 0.0
}
threshold: "0.8"

With Baseline Comparison

Adding baseline-experiment-id enables regression detection. The gate compares your PR's scores against the baseline and reports the delta in the PR comment.

- name: Run Foxhound Quality Gate
uses: ./.github/actions/quality-gate
with:
api-key: ${{ secrets.FOXHOUND_API_KEY }}
api-endpoint: https://api.foxhound.dev
dataset-id: ds_abc123
evaluator-ids: "eval_helpfulness_v2,eval_accuracy_v1"
experiment-name: "pr-${{ github.event.pull_request.number }}"
experiment-config: |
{
"model": "gpt-4o",
"temperature": 0.0
}
threshold: "0.8"
baseline-experiment-id: "exp_baseline_main"
timeout: 600

Input Reference

InputRequiredDefaultDescription
api-keyYesFoxhound API key (fox_...), stored as a repository secret
api-endpointYesFoxhound API base URL (https://api.foxhound.dev)
dataset-idYesDataset ID to run evaluations against
evaluator-idsNoComma-separated evaluator IDs. Omit to use all dataset evaluators
experiment-nameNoHuman-readable name for this experiment run
experiment-configYesJSON config for the experiment (model, temperature, etc.)
thresholdNo0.0Minimum average score to pass (0.0–1.0)
baseline-experiment-idNoCompare against this baseline. Adds regression info to PR comment
timeoutNo600Max seconds to wait before failing

Outputs

OutputDescription
experiment-idThe ID of the experiment created by this run
comparison-urlURL to the full comparison view in the Foxhound dashboard

Use comparison-url in subsequent steps to link directly to the evaluation results.

Setting Up a Baseline Experiment

The baseline is the experiment you compare PRs against — typically the last passing experiment on main.

Option 1: Manual baseline via the dashboard Run an experiment from the Foxhound dashboard against your golden dataset. Copy the resulting experiment-id and add it as the baseline-experiment-id input.

Option 2: Automated baseline on merge to main

name: Update Quality Gate Baseline

on:
push:
branches: [main]
paths:
- 'src/**'

jobs:
update-baseline:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Create Baseline Experiment
uses: ./.github/actions/quality-gate
id: baseline
with:
api-key: ${{ secrets.FOXHOUND_API_KEY }}
api-endpoint: https://api.foxhound.dev
dataset-id: ds_abc123
experiment-name: "baseline-${{ github.sha }}"
experiment-config: |
{ "model": "gpt-4o", "temperature": 0.0 }
threshold: "0.0"

- name: Store baseline experiment ID
run: |
echo "BASELINE_EXP_ID=${{ steps.baseline.outputs.experiment-id }}" >> $GITHUB_ENV
# Store in your secrets store / variable for PRs to reference

Threshold Selection

Start conservative and tighten over time:

StageRecommended Threshold
First setup0.6 — catch major regressions only
Stable product0.75–0.8 — good default
High-quality bar0.85–0.9 — for production-critical agents

Run the gate in warn-only mode (threshold 0.0) for a few weeks to build intuition before enforcing failures.

Permissions

The action posts a PR comment and requires:

permissions:
pull-requests: write

Set this at the job or workflow level — not just on the step.

Troubleshooting

"Experiment timed out" — Increase timeout. Large datasets or slow evaluators may need 900–1200 seconds.

"Score below threshold" — The PR comment includes per-evaluator breakdown. Use the comparison-url to open the full diff in the Foxhound dashboard.

"Permission denied posting comment" — Ensure permissions: pull-requests: write is set at the job or workflow level.

Gate is too noisy — Reduce your dataset to the most representative traces (20–50) or raise the threshold slowly rather than all at once.