Skip to main content

Evals

Configure the model routing rules and the test cases, then run them with the keyword grader or an LLM judge. The router is deterministic and free; running a case calls the routed model and grades the answer. Rules and cases stay in your browser; completed runs persist to the run history.

A workbench for the routing rules and a graded test suite, with a persisted run history.

  1. Type a query into the route preview to see which model the rules would pick and why.
  2. Edit the routing rules (keywords, model, reason) and the test cases (query plus expected keywords). Both stay in your browser.
  3. Click Run with the keyword grader: failed cases name the missed keywords, and Show response reveals the graded answer.
  4. Switch the grader to LLM judge and run again: each case gets a score and a rationale, shown on the case when it fails.
  5. Pin a prompt version to evaluate an older or newer system prompt against the same cases.
  6. Every run lands in the history with a pass-rate trend bar, persisted server-side.

Model routing rules

First rule whose any keyword appears in the query wins, otherwise the fallback. This is the deterministic router the assistant uses.

Fallback
routes to GPT-4o mini factual-lookup

Test cases

Results

Edit the rules and cases above, then click Run to populate the stats.

Run history

No runs recorded yet. Completed runs are stored server-side and the pass-rate trend appears here.