Evals
Configure the model routing rules and the test cases, then run them with the keyword grader or an LLM judge. The router is deterministic and free; running a case calls the routed model and grades the answer. Rules and cases stay in your browser; completed runs persist to the run history.
A workbench for the routing rules and a graded test suite, with a persisted run history.
- Type a query into the route preview to see which model the rules would pick and why.
- Edit the routing rules (keywords, model, reason) and the test cases (query plus expected keywords). Both stay in your browser.
- Click Run with the keyword grader: failed cases name the missed keywords, and Show response reveals the graded answer.
- Switch the grader to LLM judge and run again: each case gets a score and a rationale, shown on the case when it fails.
- Pin a prompt version to evaluate an older or newer system prompt against the same cases.
- Every run lands in the history with a pass-rate trend bar, persisted server-side.