Experiments (A/B testing)
Run multiple agent configurations under a single addressable name and let Coulisse pick which one serves each request. Useful for comparing models, preambles, or tool sets without changing client code.
How it works
- Define each candidate as a normal agent under
agents:. - Declare an
experimentwhosenameis what clients send asmodel. - List the candidate agents as variants and choose a strategy.
When a request arrives, the router resolves the experiment name to one variant (and optionally fires off shadow runs in the background). The variant choice is sticky-by-user by default, so the same user always lands on the same variant for a given experiment — conversation memory and persona stay consistent across turns.
Strategies
Three strategies are wired today: split, shadow, and bandit.
split
Weighted random sampling. Sticky by user when sticky_by_user: true (the default) — the variant is a deterministic hash of (user_id, experiment_name) modulo the cumulative weights, with no database writes. Adding or removing a variant reshuffles users.
agents:
- name: assistant-sonnet
provider: anthropic
model: claude-sonnet-4-5-20250929
- name: assistant-gpt
provider: openai
model: gpt-4o
experiments:
- name: assistant # what clients send as model
strategy: split
variants:
- agent: assistant-sonnet
weight: 0.5
- agent: assistant-gpt
weight: 0.5
shadow
Designate one variant as primary; it serves the user normally. The other variants run in the background against the same prepared context, are scored by their judges, and never write to the user's message history. The user never waits on shadow variants.
sampling_rate (default 1.0) controls how often shadow runs fire — set it lower to cap cost.
experiments:
- name: assistant
strategy: shadow
primary: assistant-sonnet
sampling_rate: 0.25 # 25% of turns also run the shadows
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
Use shadow to collect comparison data before flipping a split rollout — the primary serves all real traffic while you build up scoring evidence on the challenger.
bandit
Epsilon-greedy multi-armed bandit. Reads recent mean scores per variant from the existing scores table, picks the leader most of the time (1 - epsilon), and explores a random arm otherwise. Arms with fewer than min_samples recent scores are forced — the bandit only exploits once every arm has enough evidence.
agents:
- name: assistant-sonnet
provider: anthropic
model: claude-sonnet-4-5-20250929
judges: [quality]
- name: assistant-gpt
provider: openai
model: gpt-4o
judges: [quality]
judges:
- name: quality
provider: openai
model: gpt-4o-mini
rubrics:
helpfulness: Whether the assistant answered the user's question.
experiments:
- name: assistant
strategy: bandit
metric: quality.helpfulness # judge.criterion
epsilon: 0.1
min_samples: 30
bandit_window_seconds: 604800 # 7 days
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
The configured judge (quality) and the criterion (helpfulness) must be declared on every variant agent — otherwise the bandit starves on that arm. Validation enforces this at startup.
A note on stickiness: with sticky_by_user: true (the default), the bandit decision is computed at request time via a deterministic hash of (user_id, experiment_name), so a given user typically lands on the same arm. Mean scores update as new data arrives, so a user can shift if a different arm overtakes the leader — that is the trade-off for keeping the assignment stateless.
Namespace and migration
Experiment names share a namespace with agent names. To A/B-test an existing agent without breaking clients:
- Rename the agent (
assistant→assistant-v1). - Add a sibling agent (
assistant-v2). - Add an experiment named
assistantwith both as variants.
Clients keep sending model: assistant and it resolves transparently.
Variants stay individually addressable as agents under their own names (assistant-v1, assistant-v2) — useful for isolating one variant in tests or debugging.
Subagents
A subagent reference can name an agent or an experiment. If orchestrator lists subagents: [assistant] and assistant is an experiment, every subagent call resolves to a variant for the calling user, the same way a top-level request would. Sticky-by-user keeps the variant consistent across the whole conversation.
Give the experiment a purpose: if it's exposed as a subagent — it becomes the tool description the calling agent's LLM sees:
experiments:
- name: assistant
purpose: A general-purpose chat assistant.
strategy: split
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
Bandit subagents read mean scores at call time, so the same exploit/explore behaviour applies inside subagent dispatch.
Telemetry
Each turn's TurnStart event includes agent (the resolved variant), and when an experiment was hit, experiment (the experiment name) and variant (same as agent). Judge scores are tagged with the variant's agent name in the database, so per-variant aggregation flows through the same table without a join — used by the bandit's mean-score query and the studio's per-variant view.
Studio
The studio shows configured experiments at /admin/experiments: strategy, sticky-by-user flag, and per-variant weight + share. For bandit experiments, the page additionally shows the configured metric, epsilon, and min-samples threshold, plus per-variant sample counts and mean scores (loaded inline via htmx from the judges admin endpoints). Shadow experiments call out the primary variant.
Validation
Coulisse rejects the following at startup:
- Experiment name colliding with an agent name (rename one).
- Experiment name colliding with another experiment.
- Experiment with zero variants.
- Variant referencing an undefined agent.
- Variant weight
<= 0. - Duplicate variant agent within one experiment.
- Strategy-specific fields used with the wrong strategy (e.g.
primaryon asplitexperiment). shadowwithout aprimary, or with aprimarythat's not one of the variants.shadowsampling_rateoutside[0.0, 1.0].banditwithout ametric.banditmetricthat doesn't match an existingjudge.criterion, or a variant that doesn't opt into the metric's judge.banditepsilonoutside[0.0, 1.0].