Evaluation Cookbook
Foxhound provides a layered evaluation system that lets you move from quick manual feedback all the way to automated regression prevention in CI. This cookbook covers each layer with practical how-to guides.
Evaluation Philosophy
The best evaluation strategy combines multiple signals:
| Layer | Speed | Scale | Use Case |
|---|---|---|---|
| Manual scoring | Seconds | Low | Quick spot-checks, calibration |
| LLM-as-a-Judge | Minutes | Medium–High | Automated quality assessment |
| Dataset curation | Ongoing | High | Building reliable test suites |
| CI quality gates | Per PR | High | Regression prevention |
Start with manual scoring to build intuition, use LLM-as-a-Judge evaluators to scale, curate a dataset from your best examples, then lock quality in with CI gates.
What Foxhound Evaluates
Foxhound evaluates traces — complete records of an agent run from start to finish, including every LLM call, tool invocation, and memory read. Scores are attached to traces at one or more named dimensions (e.g. helpfulness, accuracy, safety).
Scores are always in the range 0.0–1.0 (higher is better) and can have a text rationale explaining the verdict.
Guides in This Section
- Manual Scoring — Score traces from your IDE using the MCP tools
foxhound_score_traceandfoxhound_get_trace_scores. - LLM-as-a-Judge — Set up and run automated evaluators that score traces using an LLM judge.
- Dataset Curation — Build evaluation datasets from production traces using score thresholds and bulk curation.
- CI Quality Gates — Automate quality enforcement on pull requests with the Foxhound GitHub Action.
Prerequisites
- Foxhound SDK instrumented in your agent (see Installation)
- At least one trace visible in the Foxhound dashboard
- Foxhound MCP server connected to your IDE (see MCP Server Setup)