Skip to main content

Evaluation Cookbook

Foxhound provides a layered evaluation system that lets you move from quick manual feedback all the way to automated regression prevention in CI. This cookbook covers each layer with practical how-to guides.

Evaluation Philosophy

The best evaluation strategy combines multiple signals:

LayerSpeedScaleUse Case
Manual scoringSecondsLowQuick spot-checks, calibration
LLM-as-a-JudgeMinutesMedium–HighAutomated quality assessment
Dataset curationOngoingHighBuilding reliable test suites
CI quality gatesPer PRHighRegression prevention

Start with manual scoring to build intuition, use LLM-as-a-Judge evaluators to scale, curate a dataset from your best examples, then lock quality in with CI gates.

What Foxhound Evaluates

Foxhound evaluates traces — complete records of an agent run from start to finish, including every LLM call, tool invocation, and memory read. Scores are attached to traces at one or more named dimensions (e.g. helpfulness, accuracy, safety).

Scores are always in the range 0.0–1.0 (higher is better) and can have a text rationale explaining the verdict.

Guides in This Section

  • Manual Scoring — Score traces from your IDE using the MCP tools foxhound_score_trace and foxhound_get_trace_scores.
  • LLM-as-a-Judge — Set up and run automated evaluators that score traces using an LLM judge.
  • Dataset Curation — Build evaluation datasets from production traces using score thresholds and bulk curation.
  • CI Quality Gates — Automate quality enforcement on pull requests with the Foxhound GitHub Action.

Prerequisites

  • Foxhound SDK instrumented in your agent (see Installation)
  • At least one trace visible in the Foxhound dashboard
  • Foxhound MCP server connected to your IDE (see MCP Server Setup)