LLM-as-judge evaluation
Coulisse can score every agent reply with a separate LLM — a judge — and persist the result so you can track quality over time. You describe what to evaluate in the YAML rubric; Coulisse handles scoring shape, format, sampling, and storage.
This is useful for watching agent drift, comparing model/preamble changes, and catching regressions without standing up a separate evaluation pipeline.
How it works
- A client sends a chat request. The agent replies as usual — the judge never blocks the response.
- After the reply is persisted, Coulisse runs each judge the agent opted in to, in a background task.
- Each judge samples according to its
sampling_rate(skip entirely if the draw misses), then asks its backing model to score the assistant's reply against every rubric at once. - The response is parsed into one
scorerow per rubric — persisted under the same user id as the conversation. - Failures (bad JSON, provider error, timeout) are logged at
warnand swallowed — the user already got their answer.
Scores are stored in the same SQLite database as messages and memories, in a scores table keyed by message_id. Averages are computed at read time, not aggregated on write.
YAML
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are a helpful assistant.
judges: [quality] # opt in by name
- name: translator
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: Translate into French.
judges: [fluency]
judges:
# Cheap, broad check — 100% of turns, small model.
- name: quality
provider: openai
model: gpt-4o-mini
sampling_rate: 1.0
rubrics:
accuracy: Factual accuracy. Flag hallucinations.
helpfulness: Whether the assistant answered the user's question.
tone: Politeness and tone.
# Targeted check for the translator — only 20% of turns.
- name: fluency
provider: openai
model: gpt-4o-mini
sampling_rate: 0.2
rubrics:
grammar: Grammatical correctness of the French output.
naturalness: How native the phrasing sounds.
The wiring is visible from the agent: when you read an agent block you see which judges score it, rather than having to hunt through the judge list to figure out coverage.
Rubrics
A rubric is a map from criterion name to a short description of what to assess.
rubrics:
accuracy: Factual accuracy. Flag hallucinations.
helpfulness: Whether the assistant answered the user's question.
Keep descriptions terse and assess-able. Don't write scale, format, or JSON instructions into them — Coulisse adds those internally. The description should tell the judge what matters, not how to answer.
Each criterion produces one Score row per scored turn, with its own numeric value and short reasoning. All criteria for one judge are evaluated in a single LLM call, so adding criteria to a judge doesn't multiply cost.
Scoring shape
Every score is an integer in 0..=10 with a one-sentence reasoning. Coulisse forces this shape through the preamble and parses the judge's JSON reply — you don't configure it.
If you need a different scale (e.g. boolean pass/fail, categorical), that will arrive as a future scale: field; the default stays numeric 0-10.
Sampling
sampling_rate controls what fraction of turns are scored.
| Value | Meaning |
|---|---|
1.0 (default) | Score every turn. |
0.1 | Roughly 10% of turns. |
0.0 | Never score (useful to park a judge without deleting it). |
The draw is independent per turn, per judge. Over many turns the scored fraction converges on the configured rate. Lower rates save tokens for expensive judges; broad cheap judges can run at 1.0.
Choosing a judge model
Pick a model that's different from the agent being scored whenever you can. A judge scoring its own output is biased — a cheap cross-provider judge (e.g. gpt-4o-mini judging a Claude agent, or vice versa) is usually closer to neutral.
Strong, slow models make sense for low-volume deep checks (sampling_rate: 0.1). Cheap, fast models make sense for high-volume broad checks (sampling_rate: 1.0).
Multiple judges per agent
Stack judges to get different dimensions at different cost points:
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
judges: [broad_check, deep_audit]
judges:
- name: broad_check
provider: openai
model: gpt-4o-mini
sampling_rate: 1.0
rubrics:
helpfulness: Whether the user's question was answered.
tone: Politeness and tone.
- name: deep_audit
provider: anthropic
model: claude-opus-4
sampling_rate: 0.05 # 5% of turns, expensive
rubrics:
accuracy: Factual accuracy, including references and claims.
safety: Harmful, biased, or unsafe content.
Each judge is independent — its own model, rate, and rubric set. A turn can end up with zero, one, or both of these judges scoring it, depending on the sampling draw.
Viewing scores
The studio UI at /admin/ now shows a Scores panel per user. It surfaces two things:
- Averages — mean score per
(judge, criterion)across every turn the user has had, with sample count. - Recent — the most recent individual scores with reasoning.
Validation at startup
Coulisse fails fast on:
- A judge referencing a provider that's not declared under
providers:. - A judge with no rubrics.
- A
sampling_rateoutside[0.0, 1.0]. - An agent referencing a judge name that doesn't exist.
Any violation aborts startup with a message naming the offending judge or agent.
Cost control
Two knobs matter:
sampling_rate— the easy one. Halve it, halve the judge bill.- Judge model — the big one. A
gpt-4o-minijudge at 100% sampling often costs less than agpt-4ojudge at 10%. Pick the cheapest model that gives you a stable signal.
A useful pattern is to run a cheap judge at 100% and a strong judge at a small fraction — the cheap one catches the broad signal, the strong one spot-checks the hardest cases.