Coulisse
One YAML file. An OpenAI-compatible server with memory, tools, and multi-backend routing.
Coulisse is a single Rust binary that reads a coulisse.yaml file and spins up an OpenAI-compatible HTTP server. You point your existing tools, SDKs, and projects at it like any other OpenAI endpoint — and everything configurable lives in that one YAML file.
Why Coulisse?
Every multi-agent project ends up re-implementing the same plumbing:
- Per-user conversation memory
- Routing between model providers
- Rate limits and retries
- Tool integration
- Multiple agents with different system prompts
Coulisse collapses this plumbing into one configurable server. You describe the setup in YAML and pilot the whole thing from there, instead of writing glue code for each prototype.
How it works
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Your SDK / app │───────▶│ Coulisse │───────▶│ Anthropic │
│ (OpenAI client) │ │ │ │ OpenAI │
└──────────────────┘ │ coulisse.yaml │ │ Gemini … │
│ │ └──────────────────┘
│ + memory │
│ + MCP tools │ ┌──────────────────┐
│ + per-user │───────▶│ MCP servers │
└──────────────────┘ └──────────────────┘
- Your application talks to Coulisse using any OpenAI-compatible SDK.
- Coulisse picks the agent you asked for (by model name), assembles the user's memory, and calls the right backend.
- The response flows back — and the exchange is saved to that user's memory for next time.
What's in the box
| Feature | Status |
|---|---|
| Multi-agent routing | ✅ Working |
| Per-user memory | ✅ Persistent (SQLite) with semantic recall |
| Real embedders | ✅ OpenAI + Voyage (hash fallback for offline dev) |
| Auto-extraction | ✅ Optional — pulls durable facts from each exchange |
| MCP tool integration | ✅ Working (stdio + HTTP) |
| Multi-backend support | ✅ Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq |
| OpenAI-compatible API | ✅ /v1/chat/completions, /v1/models |
| Streaming responses | ✅ Server-Sent Events |
| Rate limiting | ✅ Per-user token quotas (hour / day / month, in-memory) |
| Studio UI | ✅ Read-only at /admin/ |
| Workflow orchestration | ⏳ Planned |
| Durable rate-limit state | ⏳ Planned |
Continue to Installation to get started.
Stability
Coulisse is pre-1.0. It follows Semantic Versioning, but
during the 0.x phase, minor version bumps (0.1 → 0.2) may include breaking
changes to the YAML schema, HTTP surface, or CLI. Patch bumps (0.1.0 → 0.1.1)
will not. See the Releasing chapter and
CHANGELOG.md
for the version history.
Installation
Coulisse is a single Rust binary. Install it from a prebuilt release or build from source.
Requirements
- A valid API key for at least one supported provider
Install from a release
The latest GitHub Release ships installers for macOS (x86 + ARM), Linux GNU (x86 + ARM), and Windows MSVC.
macOS / Linux:
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/Almaju/coulisse/releases/latest/download/coulisse-installer.sh | sh
Windows (PowerShell):
powershell -ExecutionPolicy Bypass -c "irm https://github.com/Almaju/coulisse/releases/latest/download/coulisse-installer.ps1 | iex"
The installer drops the coulisse binary on your PATH.
Build from source
Requires Rust (edition 2024) — install from rustup.rs.
git clone https://github.com/Almaju/coulisse.git
cd coulisse
cargo build --release
The binary lands at target/release/coulisse. Drop it on your PATH
(or alias it) so the rest of this guide can call it as coulisse.
Initialize a config
coulisse init
This writes a minimal coulisse.yaml in the current directory: one
OpenAI agent, sqlite memory, the offline hash embedder. Run
coulisse init --from-example instead for the full annotated tour
covering every section.
Edit the file to set your provider API key.
Start the server
coulisse start
start runs the server detached: it returns immediately and the
process keeps running in the background. Stop it later with
coulisse stop.
To run attached (logs streaming to your terminal), use
coulisse start --foreground — or just coulisse with no subcommand.
Either form binds port 8421.
You should see a startup banner like:
coulisse 0.1.0
Proxy → http://localhost:8421/v1
Admin → http://localhost:8421/admin
Memory sqlite at ./.coulisse/memory.db; embedder=hash (dims=256, OFFLINE — no semantic understanding)
Auth proxy: open · admin: open
Agents (1)
assistant openai / gpt-4o-mini
The exact lines depend on your config — what matters is that memory, auth, and every configured agent are each acknowledged on startup.
Next: write your first config, or read the CLI reference for every subcommand.
Your first config
A minimal coulisse.yaml has two things: a provider (where to send model calls) and an agent (how to call it).
providers:
anthropic:
api_key: sk-ant-your-key-here
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are a helpful assistant.
Save this as coulisse.yaml in your working directory, then run coulisse.
What each piece does
providers
A map of provider kind → credentials. The key must be one of the supported kinds (see Providers). You only need to list the providers you actually use.
agents
A list of agents. Each agent is a named recipe:
name— the identifier. Clients ask for the agent by this name via themodelfield in their request.provider— which configured provider to route to.model— the upstream model identifier to call (e.g.claude-sonnet-4-5-20250929,gpt-4o).preamble— optional system prompt prepended to every conversation.
You can define as many agents as you want — see Multi-agent routing for what that unlocks.
Adding more
Want a code reviewer, a pirate, and a tool-using agent? Just add more entries:
providers:
anthropic:
api_key: sk-ant-...
openai:
api_key: sk-...
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are a helpful assistant.
- name: code-reviewer
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: |
You are a thorough code reviewer. Focus on correctness,
clarity, and security.
- name: gpt-assistant
provider: openai
model: gpt-4o
preamble: You are a helpful assistant.
Restart the server — all three agents are now selectable by model name.
Next: make a request.
Making a request
Coulisse exposes an OpenAI-compatible API, so any OpenAI SDK works. Point the client at http://localhost:8421/v1 and set the model field to an agent name from your config.
curl
curl http://localhost:8421/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Python (openai SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8421/v1",
api_key="not-needed", # Coulisse doesn't check this
)
response = client.chat.completions.create(
model="assistant",
messages=[{"role": "user", "content": "Hello!"}],
extra_body={"safety_identifier": "user-123"},
)
print(response.choices[0].message.content)
TypeScript / JavaScript
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8421/v1",
apiKey: "not-needed",
});
const response = await client.chat.completions.create({
model: "assistant",
messages: [{ role: "user", content: "Hello!" }],
// @ts-expect-error — extra field passed through
safety_identifier: "user-123",
});
console.log(response.choices[0].message.content);
The safety_identifier field
Coulisse identifies users through the safety_identifier field (or the deprecated user field, which works too). The identifier is what keeps each user's conversation history isolated.
You can turn this off — see User identification — but by default every request needs one.
Listing available agents
curl http://localhost:8421/v1/models
Returns every agent you've defined, in OpenAI's model-list format.
That's the whole loop. Next, dig into how to configure providers.
Providers
Providers are where your model calls actually go. Configure each provider once with its credentials; reference it by name from any number of agents.
Supported providers
| Kind | Config key |
|---|---|
| Anthropic | anthropic |
| Cohere | cohere |
| Deepseek | deepseek |
| Gemini | gemini |
| Groq | groq |
| OpenAI | openai |
Shape
providers:
anthropic:
api_key: sk-ant-...
openai:
api_key: sk-...
gemini:
api_key: ...
Each provider takes a single field: api_key. You only need to list the providers you plan to use — unused ones can be omitted entirely.
Validation
When Coulisse loads your config, it checks that every agent's provider field matches a key under providers. Misspell a provider and startup fails with a clear error:
agent 'assistant' references provider 'antropic' which is not configured
Switching providers
Because providers are referenced by name, switching an agent from one backend to another is a one-line change:
agents:
- name: assistant
provider: anthropic # ← change this …
model: claude-sonnet-4-5-20250929 # ← … and this
preamble: You are helpful.
No client code changes, no redeployment of downstream apps. See Multi-backend support for more on mixing providers.
Agents
Agents are the named personas clients can talk to. Each agent pins down:
- Which provider to call
- Which upstream model to ask for
- What system prompt to prepend
- Which tools (if any) to expose
Shape
agents:
- name: code-reviewer
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: |
You are a thorough code reviewer. Focus on correctness,
clarity, and security. Point out subtle bugs and suggest
concrete improvements.
mcp_tools:
- server: hello
only:
- say_hello
Fields
name (required)
The agent identifier. Clients select this agent by passing name as the model field in their request. Names must be unique across the config.
provider (required)
Must match a key under the top-level providers map. Tells Coulisse which backend to route through.
model (required)
The upstream model identifier. This is provider-specific — e.g. claude-sonnet-4-5-20250929 for Anthropic, gpt-4o for OpenAI, gemini-2.0-flash for Gemini.
preamble (optional)
A system prompt prepended to every conversation this agent handles. Use it to define tone, expertise, constraints, output format — anything you'd normally put in a system message.
Defaults to empty. YAML block scalars (|) are handy for multi-line preambles.
judges (optional)
A list of judge names (from the top-level judges: block) that evaluate this agent's replies in the background. Empty or omitted = no evaluation. See LLM-as-judge evaluation for the full story.
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
judges: [quality, deep_audit]
mcp_tools (optional)
A list of MCP servers and tools this agent is allowed to use. See MCP tools for the full story.
mcp_tools:
- server: hello # all tools from "hello"
- server: calculator # all tools from "calculator"
only: # …but only these specific ones
- add
- multiply
subagents (optional)
A list of other agent names exposed to this agent as callable tools. When the agent's model decides to invoke one, Coulisse starts a fresh conversation against that agent and returns its final message as the tool result.
subagents: [onboarder, resume_critic]
Each name must refer to another entry under agents. Self-reference and duplicates are rejected at startup. Nested invocations are capped at depth 4 to prevent runaway loops. See Multi-agent routing for the full walkthrough.
purpose (optional)
A short tool description shown to other agents when this one is listed under their subagents. Keep it concrete — it's how a calling agent's model decides when to invoke this specialist. Omit it for agents that are only used directly by clients (never as subagents); fall back is "Invoke the '<name>' subagent." but a hand-written purpose is what makes multi-agent orchestration reliable.
purpose: Critique and rewrite a resume for a target role.
Runtime overrides
Agents can also be created, edited, and disabled at runtime through the admin UI or HTTP without touching coulisse.yaml. These runtime entries live in the SQLite database alongside conversation memory and judge scores; the YAML file is never modified by the server.
The resolution rule is simple: when a name is requested, the database is checked first. If a row exists there, it wins. Otherwise the YAML entry (if any) is used. A row can also be a tombstone — a marker that disables a YAML-declared name without removing it from the file.
Each runtime row carries a label visible in the admin UI:
- yaml — the agent comes from
coulisse.yaml, no database row exists. - dynamic — created via the admin UI or HTTP; no YAML entry of this name.
- override — both YAML and the database define this name; the database version is what runs.
- tombstoned — a database row disables this name; the agent is hidden from clients even if YAML still declares it.
A "Reset to YAML" action on an override deletes the database row, letting the YAML version reassert. The same action on a tombstoned row re-enables the agent. Database edits never modify the YAML file: if you want a change to survive a database wipe, edit the YAML.
Several agents, one config
Define as many agents as you want. A common pattern is having variants of the same model with different preambles:
agents:
- name: friendly
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are warm and encouraging.
- name: terse
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: Reply in one sentence. No preamble, no filler.
- name: pirate
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: Respond exclusively as a pirate, arrr.
Clients switch between them by changing the model field — no server redeploy, no code change.
MCP tools
Coulisse can borrow tools from Model Context Protocol servers and hand them to your agents. Two transports are supported:
- stdio — Coulisse spawns a local command and talks to it over stdin/stdout.
- http — Coulisse connects to a running Streamable-HTTP MCP endpoint.
Declaring MCP servers
Add an mcp section with a named entry per server:
mcp:
hello:
transport: stdio
command: uvx
args:
- --from
- git+https://github.com/macsymwang/hello-mcp-server.git
- hello-mcp-server
calculator:
transport: http
url: http://localhost:8080
stdio fields
transport: stdiocommand(required) — the executable to spawn (uvx,python,node, …)args(optional) — arguments to passenv(optional) — environment variables for the child process
mcp:
my-tool:
transport: stdio
command: python
args: [-m, my_mcp_server]
env:
DEBUG: "1"
API_KEY: abc123
http fields
transport: httpurl(required) — the endpoint URL
Granting tool access to agents
An agent only sees tools you explicitly give it. Reference the server name under mcp_tools:
agents:
- name: helper
provider: anthropic
model: claude-sonnet-4-5-20250929
mcp_tools:
- server: hello # all tools from "hello"
Restrict to a subset with only:
mcp_tools:
- server: hello
only:
- say_hello # only this tool, nothing else
Discovering tool names
On startup Coulisse connects to each MCP server and logs the tools it discovered. Tool names in your only list must match what the server advertises — check the startup output or the server's own docs.
How tool calls work
When a request arrives for an agent with tools:
- Coulisse collects the agent's allowed tools from the MCP servers.
- It forwards them to the model as tool definitions.
- If the model calls a tool, Coulisse dispatches to the MCP server and feeds the result back.
- This loops until the model produces a final answer (up to 8 turns).
Your client doesn't see any of this — the tool loop is invisible, and only the final assistant message is returned.
See MCP tool integration for a full walkthrough.
Memory
The memory: block in coulisse.yaml controls where data is stored, which embedder turns text into vectors, and whether auto-extraction runs after each turn. Every field has a sensible default — omit the block entirely and Coulisse falls back to an on-disk SQLite file and the offline hash embedder.
Shape
memory:
backend:
kind: sqlite # 'sqlite' (default) or 'in_memory'
path: ./coulisse-memory.db # sqlite only
embedder:
provider: openai # 'openai', 'voyage', or 'hash'
model: text-embedding-3-small # required for openai/voyage
# api_key: <override> # optional — falls back to providers.openai.api_key
extractor: # omit to disable auto-extraction
provider: anthropic # one of providers.* keys
model: claude-haiku-4-5-20251001
dedup_threshold: 0.9 # optional
max_facts_per_turn: 5 # optional
context_budget: 8000 # optional
memory_budget_fraction: 0.1 # optional
recall_k: 5 # optional
memory.backend
| Field | Type | Required | Notes |
|---|---|---|---|
kind | enum | yes | sqlite or in_memory. |
path | string | no | Filesystem path for sqlite. Created if missing. Default ./coulisse-memory.db. |
in_memory is a SQLite database that lives only for the process lifetime — use it for tests or throw-away demos. sqlite is the production default; for Docker, point path at a volume-mounted location (e.g. /var/lib/coulisse/memory.db).
memory.embedder
| Field | Type | Required | Notes |
|---|---|---|---|
provider | enum | yes | openai, voyage, or hash. |
model | string | depends | Required for openai and voyage. Ignored for hash. |
api_key | string | no | Falls back to providers.<provider>.api_key when unset. |
dims | int | no | Hash only. Default 32. |
Supported models
openai:text-embedding-3-small(1536 dims, default),text-embedding-3-large(3072 dims),text-embedding-ada-002(1536 dims).voyage:voyage-3.5(1024, default),voyage-3-large(1024),voyage-3.5-lite(1024),voyage-code-3(1024),voyage-finance-2(1024),voyage-law-2(1024),voyage-code-2(1536).
Unknown model names fail at startup with a clear error.
Which to pick
- Using Anthropic for completions? Anthropic has no embedding API — use Voyage (their official recommendation).
- Using OpenAI? Stay on OpenAI for consistency.
- Offline / air-gapped? Use
hash— it has no semantic understanding but is fast and deterministic.
memory.extractor
Omit this block to disable auto-extraction. When present:
| Field | Type | Required | Notes |
|---|---|---|---|
provider | string | yes | Must match a key under top-level providers:. |
model | string | yes | Upstream model identifier. Prefer the cheapest usable model. |
dedup_threshold | float | no | Cosine similarity above which an extracted fact is considered a duplicate. Default 0.9. |
max_facts_per_turn | int | no | Cap on facts written per exchange. Default 5. |
The extractor runs as a background task after each successful completion — it never blocks the HTTP response. Failures are logged at warn and swallowed.
Budget knobs
| Field | Default | Meaning |
|---|---|---|
context_budget | 8,000 tokens | Total window for messages + memories. |
memory_budget_fraction | 0.1 (10%) | Share of the budget reserved for recalled memories. |
recall_k | 5 | Top-k memories fetched per request. |
Startup log line
On boot, Coulisse prints the memory config it resolved:
memory: sqlite at ./coulisse-memory.db; embedder=openai / text-embedding-3-small
extractor: anthropic / claude-haiku-4-5-20251001 (dedup_threshold=0.9, max_facts_per_turn=5)
Or when the extractor is off:
extractor: disabled (memory only grows via explicit API calls)
Example configs
OpenAI end-to-end
providers:
openai:
api_key: sk-...
memory:
embedder:
provider: openai
model: text-embedding-3-small
extractor:
provider: openai
model: gpt-4o-mini
Anthropic completions + Voyage embeddings
providers:
anthropic:
api_key: sk-ant-...
memory:
embedder:
provider: voyage
model: voyage-3.5
api_key: pa-... # Voyage is not under providers: so set the key here
extractor:
provider: anthropic
model: claude-haiku-4-5-20251001
Offline dev — no external calls
memory:
backend:
kind: in_memory # ephemeral; evaporates on restart
embedder:
provider: hash
# no extractor, no embeddings API calls, no persistence
Telemetry
The telemetry: block controls observability — what Coulisse logs to stderr, what it persists to SQLite for the studio UI, and whether it ships traces to your own OpenTelemetry backend.
Every field has a sensible default. Omit the block and you get stderr logs at info plus the studio's per-turn event tree, with no external traces.
Shape
telemetry:
fmt:
enabled: true # stderr logs; default on
sqlite:
enabled: true # mirrors spans into the studio's tables; default on
otlp: # absent = disabled (default)
endpoint: "http://localhost:4317"
protocol: grpc # or http_binary
service_name: coulisse
headers:
authorization: "Bearer ${OTEL_API_KEY}"
All three layers compose. Turn sqlite off if you don't need the studio. Add otlp to ship the same traces to Grafana, SigNoz, Jaeger, Honeycomb, or any OTLP-compatible backend.
telemetry.fmt
| Field | Type | Required | Notes |
|---|---|---|---|
enabled | bool | no | Default true. |
Writes structured logs to stderr. The level is controlled by the RUST_LOG environment variable; without it, the default is info,sqlx=warn (info from Coulisse, warnings only from the SQL driver). To see internal SQL traffic, run with RUST_LOG=debug. To silence everything, set RUST_LOG=error.
telemetry.sqlite
| Field | Type | Required | Notes |
|---|---|---|---|
enabled | bool | no | Default true. |
Mirrors turn and tool_call tracing spans into the events and tool_calls tables that the studio UI reads. Without this layer, the studio loses its per-turn event tree and tool-call panel.
The schema is part of the same SQLite file the rest of Coulisse persists into (controlled by memory.backend.path).
telemetry.otlp
Absent (the default) means Coulisse does not export traces externally. To plug Coulisse into your own observability stack, set the block:
| Field | Type | Required | Notes |
|---|---|---|---|
endpoint | string | yes | Collector URL. |
protocol | enum | no | grpc (default) or http_binary. |
service_name | string | no | OpenTelemetry resource attribute service.name. Default coulisse. |
headers | map | no | Static HTTP/gRPC headers attached to every export. |
Endpoint defaults
- gRPC (the default): port
4317, e.g.http://localhost:4317. - HTTP/protobuf: port
4318, e.g.http://localhost:4318/v1/traces.
The collector you point at decides the rest — Coulisse ships traces with service.name = coulisse and span names turn, tool_call, and llm_call. Span fields carry user_id, turn_id, agent, tool_name, kind, and the rest documented in the features chapter.
Headers
Useful for managed backends:
telemetry:
otlp:
endpoint: "https://ingest.us.signoz.cloud:443"
protocol: grpc
headers:
"signoz-access-token": "${SIGNOZ_TOKEN}"
YAML doesn't expand ${...} itself; substitute at deploy time (helm, envsubst, sops, etc.).
How the layers compose
The cli installs a single tracing_subscriber registry with the layers your config asked for, in order:
RUST_LOGenv filterfmt→ stderr (whenfmt.enabled)sqlite→events+tool_callstables (whensqlite.enabled)otlp→ external collector (whenotlpis set)
Every span emitted by the running server fans out to all enabled layers. There is no priority or fallback — the SQLite layer keeps full payloads (full prompts, args, results), the OTLP layer ships the same span attributes to your collector. If your backend chokes on multi-megabyte attributes, drop those fields in your collector pipeline rather than at the source.
User identification
Coulisse keeps separate memory per user. To do that, it needs to know who is making each request.
How users are identified
Requests identify the user via one of these fields, in order:
safety_identifier(preferred — matches OpenAI's recent schema)user(deprecated, but still accepted)
{
"model": "assistant",
"safety_identifier": "alice@example.com",
"messages": [...]
}
The identifier can be anything — an email, an internal user ID, a UUID, an opaque token. Coulisse derives a stable internal UUID from it:
- If you pass a valid UUID, that's what's used.
- Otherwise, a deterministic v5 UUID is derived from the string, so the same identifier always maps to the same user.
Requiring identification
By default, Coulisse requires every request to carry an identifier. Unidentified requests are rejected with an error. This is the safe default: memory only works if you know who you're talking to.
default_user_id: a single-user fallback
For local development or single-user deployments, you can declare a default_user_id in coulisse.yaml. When a request arrives without safety_identifier or user, Coulisse acts as if that default had been passed.
default_user_id: main # everyone's anonymous requests bucket here
providers:
anthropic:
api_key: sk-ant-...
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
With a default_user_id set:
- Requests that omit both
safety_identifieranduserfall back to the default. They get memory like any other user — just scoped to that shared bucket. - Requests that do include an identifier still get their own scope.
- All anonymous requests share one memory bucket and one rate-limit counter, because they all map to the same id.
When to set it
Good reasons:
- Local / single-user setups where you don't want to bother sending an identifier.
- Small deployments behind an auth layer that handles identity upstream but doesn't want to plumb it through.
Don't set default_user_id in multi-tenant deployments — every user would share one bucket, which defeats isolation. Leave it unset so missing identifiers are rejected.
Studio UI
Coulisse ships a studio UI for browsing the conversations and memories the server has seen, and for editing the live YAML config. It's served by the same binary, under /admin/.
Point a browser at http://localhost:8421/admin/ while the server is running.
What you can do
- List every user the server has seen, most recent activity first, with message and memory counts.
- Open a user to see their full conversation (user, assistant, and system messages) with per-message token counts and relative timestamps.
- See every tool invocation that happened during each assistant turn — rendered inline in the conversation as a collapsed block above the assistant bubble. Expand to see the args, the result (or error body), and a badge marking MCP vs subagent calls. This is the debug view for figuring out what the agent tried and what came back.
- Open the per-turn Telemetry block under any assistant message to see the full causal tree that produced it: every tool call (MCP or subagent) at every depth, with args, result, error, and duration. Unlike the inline top-level tool calls, the telemetry tree also surfaces tool calls made inside subagents — so when a subagent's MCP call fails, the real error is right there instead of being paraphrased into the assistant's text.
- See the long-term memories recalled for that user, tagged as
factorpreference. - See the LLM-as-judge scores for that user, including mean score per
(judge, criterion)and the most recent individual scores with reasoning. - Browse configured experiments at
/admin/experiments— strategy, sticky-by-user flag, per-variant weights, and bandit-strategy mean scores live-loaded from judges. - Run smoke tests at
/admin/smoke— a synthetic-user persona drives a real conversation against any agent or experiment, scores fan out through the same judge pipeline, and the run viewer shows the full transcript with persona/assistant turns side by side. Useful for iterating on agent prompts without writing test scaffolding. - Edit, add, or disable agents, judges, experiments, and smoke tests at
/admin/agents,/admin/judges,/admin/experiments, and/admin/smoke. Each form is a YAML textarea over the same config shape used incoulisse.yaml. Edits and creations write to the database, never tocoulisse.yaml; runtime resolution checks the database first, then falls back to YAML. List views label each row asyaml,dynamic(database-only),override(database shadows YAML), ortombstoned(disabled). Override rows expose a "Reset to YAML" action that drops the database row so the YAML version reasserts. See Agents → Runtime overrides for the full semantics — judges, experiments, and smoke tests follow the same model.
Editing config: admin UI = API
Every admin route is content-negotiated. The same URL serves an HTML page in a browser, an HTML fragment to htmx, and JSON to a script — whichever the client's Accept/HX-Request headers ask for. The UI is a thin representation of the API; nothing the UI can do is unavailable to a curl call.
# List agents as JSON (effective merged view: database overrides + YAML)
curl -H 'Accept: application/json' http://localhost:8421/admin/agents
# Update an agent (writes to the database, not to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/agents/bob \
-H 'Content-Type: application/yaml' \
--data-binary $'name: bob\nprovider: openai\nmodel: gpt-4o\n'
# Reset an override or tombstone — drops the database row, YAML reasserts
curl -X POST http://localhost:8421/admin/agents/bob/reset
# Replace the whole config file in one shot (this writes to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/config \
-H 'Content-Type: application/yaml' \
--data-binary @coulisse.yaml
Agent writes through /admin/agents go to the database, never to coulisse.yaml. Other sections (/admin/config, providers, judges, experiments, smoke tests, etc.) still write to YAML. The two write paths are independent: editing an agent in the database has no effect on the file you committed to git.
File watcher: hand-edits hot-reload
Coulisse watches coulisse.yaml while it runs. Edit it in your editor, save, and the live config updates without a restart. The validator runs before any reload — a broken edit is logged and the previous in-memory config keeps serving traffic until you fix the file.
What hot-reloads today: the agents list (runtime + admin display), the judges and experiments lists (admin display only — the routing tables that consume them are still rebuilt on restart). What still requires restart: providers, MCP servers, memory backend, telemetry pipeline, auth.
YAML formatting
Admin saves go through serde_yaml round-trip serialization, so comments, blank lines, and key ordering are not preserved. If you want commented config, hand-edit the file — the watcher picks the change up the same way an admin save would. Comment-preserving writes are tracked as a follow-up.
Authentication
The admin surface is gated by the auth.admin scope in coulisse.yaml. Two mutually exclusive modes: HTTP Basic auth (good for local dev) or OIDC single sign-on (appropriate for shared deployments). Exactly one belongs under auth.admin.
The /v1/chat/completions and /v1/models endpoints use the separate auth.proxy scope — they are never gated by admin auth. SDK clients stay cookie-free even when the studio runs behind OIDC.
Basic auth
auth:
admin:
basic:
password: choose-something-strong
username: admin # optional, defaults to "admin"
Every /admin/* request must carry Authorization: Basic <base64(user:pass)>. Browsers prompt via the native login dialog and cache credentials per origin.
OIDC (single sign-on)
Works with any OIDC-compliant IdP: Authentik, Keycloak, Auth0, Google, Microsoft, Okta.
auth:
admin:
oidc:
issuer_url: https://authentik.example.com/application/o/coulisse/
client_id: coulisse-admin
client_secret: <confidential-client-secret> # omit for public PKCE clients
redirect_url: http://localhost:8421/admin/
scopes: [email, profile] # optional; openid is always added
On first request, the user is redirected to the IdP to log in; afterwards an encrypted session cookie keeps them authenticated on /admin/* until it expires (8 hours of inactivity).
Access control (who may log in) is delegated to the IdP. Coulisse treats "successfully authenticated by your IdP" as "authorized admin" — configure the allow-list in the IdP's application policy, not here.
Authentik setup: create a new OAuth2/OpenID Provider and Application, set the redirect URI to the redirect_url above (Authentik allows every subpath of it by default), and point Coulisse at the issuer URL of the provider. Add the application to the groups that should have access via Authentik bindings.
Sessions are in-memory: they evaporate on restart — users re-authenticate silently if their IdP session is still valid, otherwise they see the login page again.
Leaving it open
Omit the auth.admin block to leave the admin surface unauthenticated. That's fine on a loopback-only dev box, but never expose an unauthenticated admin surface to the network. If you'd rather terminate auth at your infrastructure layer, put Coulisse behind a reverse proxy (oauth2-proxy, Cloudflare Access, Caddy's forward_auth), a VPN, or an SSH tunnel.
How it's built
The studio is composed in the cli binary. Each feature crate (memory, telemetry, judges, experiments) owns its own admin module — its routes, its askama templates, and its view models. Cli wires them together: a single base.html shell, the auth wrapping, and a tower middleware that wraps non-htmx responses in the layout so bookmarked deep URLs render with full navigation.
Cross-feature views (e.g. tool-call panels inside a conversation page) are filled in via htmx fragments — the conversation page, owned by memory, embeds hx-get requests against telemetry and judges. No feature crate depends on another for its admin surface; the browser orchestrates the composition. Tailwind (loaded via CDN) provides styling. Everything ships in the single Coulisse binary; there is no separate frontend build step.
Multi-agent routing
Coulisse lets you define multiple agents and route between them with nothing more than the model field of a request. No extra endpoints, no custom headers, no proxy tricks.
Why it matters
Most apps end up needing more than one model configuration:
- A fast, cheap agent for classification and quick replies.
- A heavier agent for hard reasoning.
- A specialized agent (code reviewer, translator, summarizer) with a tuned preamble.
- A tool-using agent that can reach into an MCP server.
Without something like Coulisse, that means either multiple deployments or a growing pile of if (mode === ...) switches inside your app.
The pattern
Declare each variant as a separate agent:
agents:
- name: triage
provider: anthropic
model: claude-haiku-4-5-20251001
preamble: Classify the user's intent. Reply with a single word.
- name: reasoner
provider: anthropic
model: claude-opus-4-7
preamble: You are a careful reasoner. Think step by step.
- name: translator
provider: openai
model: gpt-4o
preamble: Translate the user's message into French.
Your application picks which agent to call by setting the model field:
fast = client.chat.completions.create(model="triage", ...)
smart = client.chat.completions.create(model="reasoner", ...)
fr = client.chat.completions.create(model="translator", ...)
What each agent brings to the request
When a request arrives, Coulisse:
- Looks up the named agent.
- Prepends the agent's preamble as a system message.
- Resolves the agent's allowed MCP tools (if any).
- Forwards the call to the agent's configured provider and model.
- Records the exchange in the caller's per-user memory.
Changing agents is free — you don't need to redeploy anything on the client side.
Discovering agents at runtime
GET /v1/models returns every agent in the config in OpenAI's standard model-list format. Useful for UIs that want to populate a model picker from the server:
curl http://localhost:8421/v1/models
Subagents: agents as tools
Routing by model lets the client pick an agent per request. Sometimes you want one agent to call another from within a turn, so the conversation stays with the top-level agent while specialists handle focused sub-tasks. Coulisse exposes this via the subagents field.
agents:
- name: onboarder
provider: anthropic
model: claude-haiku-4-5-20251001
purpose: Collect the user's profile — first name, last name, phone, goals.
preamble: |
Ask the user for any missing profile field. Keep questions short.
- name: resume_critic
provider: anthropic
model: claude-sonnet-4-5-20250929
purpose: Critique and rewrite a resume for a target role.
preamble: |
Given a resume and a target role, return a revised resume and
a bullet list of the biggest gaps to address.
- name: career_coach
provider: anthropic
model: claude-sonnet-4-5-20250929
subagents: [onboarder, resume_critic]
preamble: |
Guide the user. Delegate to `onboarder` if the profile is
incomplete, and `resume_critic` when they want resume work.
When career_coach runs, the onboarder and resume_critic agents appear in its tool list alongside any MCP tools. If the model calls onboarder, Coulisse starts a fresh conversation against that agent with just the message it was given — the onboarder sees its own preamble and its own MCP tools, nothing inherited from the parent. The onboarder's final assistant message is returned to the coach as the tool result.
The purpose field
purpose is the tool description shown to the calling agent. It's how the coach's LLM decides whether this subagent is the right choice for the current turn. Keep it short and concrete — "Critique and rewrite a resume for a target role" is good; "Helpful assistant" is useless.
If purpose is absent, Coulisse falls back to "Invoke the '<name>' subagent." — functional, but a clear purpose is what makes orchestration reliable.
Bounded recursion
Calling a subagent is itself a tool call — the subagent can have its own subagents, which can have their own, and so on. To prevent a pathological A → B → A → … loop from burning tokens, Coulisse caps nested invocations at depth 4. Going over returns a clear error that the parent agent sees and can react to.
Fresh context
Every subagent invocation starts with a new conversation. The subagent does not see the parent's message history, the user's original request, or any other sibling subagent's output. It gets only the message the parent passed when calling it, plus its own preamble.
This isolation is deliberate. It keeps subagents focused, prevents context bloat, and makes each subagent's behavior reproducible in isolation. If you want data to flow between agents, store it in an MCP server and have both agents read it — Coulisse owns no cross-agent state.
Why subagents and MCPs live side by side
mcp_tools and subagents both appear in an agent's tool list, but they model different things:
- An MCP tool is a stateless function call against an external server: fixed schema, data in and data out.
- A subagent is another LLM conversation that happens to be kicked off by a tool call. It has its own preamble, its own tool loop, and can itself delegate further.
Reach for mcp_tools when the work is a concrete operation (save a record, search a database, send an email). Reach for subagents when the work needs its own LLM reasoning under a different preamble.
Per-user memory
Every request that carries a user identifier gets an isolated, persistent memory scope. Coulisse tracks two kinds of memory:
- Conversation history — the running transcript of messages the user has exchanged.
- Long-term memories — durable facts and preferences, embedded for semantic recall.
You don't need to manage this — it happens automatically on every request. When auto-extraction is on, Coulisse also decides what is worth remembering.
What happens on each request
- Coulisse identifies the user from
safety_identifier(oruser). - It pulls the user's recent messages, fitting as many as possible into the context budget.
- It runs a semantic recall against the user's long-term memories, picking the top matches.
- It builds the final prompt: agent preamble → recalled memories → recent history → new message.
- The model's reply is sent back and saved to the user's transcript.
- If an extractor is configured, a background task asks a cheap model "any durable facts to remember from this exchange?" and stores novel ones.
Step 6 does not block the HTTP response — the user gets their answer first; memory grows in the background.
Isolation guarantees
User isolation is enforced by the API: Store::for_user(id) returns a handle scoped to a single user, and every SQL query bound through it filters on that user id. There is no code path that mixes data across users.
The context budget
| Knob | Default | Meaning |
|---|---|---|
context_budget | 8,000 tokens | Total window size for messages + memories. |
memory_budget_fraction | 0.1 (10%) | Share of the budget reserved for recalled long-term memories. |
recall_k | 5 | How many long-term memories to recall per request. |
The remaining 90% goes to recent message history, newest first. If the history doesn't fit, older messages are dropped.
Embedders
Long-term memories are embedded as vectors. On each request, Coulisse embeds the incoming message and retrieves the top-k most similar memories by cosine similarity. That's how context from a conversation two weeks ago can surface when it becomes relevant again.
| Provider | Supported models | Notes |
|---|---|---|
openai | text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002 | Default pairing for OpenAI-first setups. |
voyage | voyage-3.5, voyage-3-large, voyage-3.5-lite, voyage-code-3, voyage-finance-2, voyage-law-2, voyage-code-2 | Anthropic officially recommends Voyage for embeddings. |
hash | n/a | Deterministic bag-of-words, offline only. No semantic understanding — use only for tests and air-gapped development. |
Startup logs the chosen embedder. For hash the log line carries an explicit "OFFLINE — no semantic understanding" tag so nobody deploys it by accident.
Auto-extraction ("remember what matters")
When you set memory.extractor in YAML, every completed exchange fires a background task that:
- Sends the last user-turn + assistant-turn to a cheap model with a focused prompt: "list any durable facts or preferences about the user; return
[]if nothing worth keeping." - Parses the JSON response.
- For each extracted fact, calls
remember_if_novel— which embeds the fact and skips it if cosine similarity against an existing memory exceedsdedup_threshold(default 0.9).
Failures (bad JSON, timeout, provider error) are logged at warn and swallowed — the user already got their response. Extraction is best-effort.
To disable, omit the memory.extractor block entirely. Memories will still be recalled and can be populated through other code paths, but nothing writes to them automatically.
What gets stored where
| Data | Scope | Lifetime |
|---|---|---|
| Conversation messages | Per user | SQLite (messages table) |
| Long-term memories + vectors | Per user | SQLite (memories table, BLOB embeddings) |
| Tool invocations | Per user | SQLite (tool_calls table, linked to messages.id) |
| Judge scores | Per user | SQLite (scores table, linked to messages.id) |
| User identifier → internal ID | Shared | Derived deterministically — no storage needed |
Each memory row carries the id of the embedder that produced it. If you swap the embedder, old vectors become ineligible for recall (they'd be scored in the wrong space). They stay in the database but are silently ignored until you re-embed them.
Storage location
Defaults to ./coulisse-memory.db. Override with:
memory:
backend:
kind: sqlite
path: /var/lib/coulisse/memory.db
For tests or one-shot demos, use kind: in_memory — everything evaporates on shutdown.
Docker
The bundled Dockerfile declares a VOLUME /var/lib/coulisse so data survives container restarts. Mount a named volume or a host directory there:
docker run \
-v coulisse-data:/var/lib/coulisse \
-v $(pwd)/coulisse.yaml:/etc/coulisse/coulisse.yaml:ro \
-p 8421:8421 \
coulisse
The container runs as a non-root coulisse user and expects the database path inside the volume, e.g. /var/lib/coulisse/memory.db.
See memory configuration for the full YAML schema.
MCP tool integration
Coulisse is a client for Model Context Protocol servers. Any MCP-compliant tool — a calculator, a filesystem browser, a REST API wrapper, your in-house data fetcher — becomes usable by any agent with a one-line config change.
End-to-end example
Imagine a small MCP server that exposes a say_hello tool. Register it and hand it to an agent:
providers:
anthropic:
api_key: sk-ant-...
mcp:
hello:
transport: stdio
command: uvx
args:
- --from
- git+https://github.com/macsymwang/hello-mcp-server.git
- hello-mcp-server
agents:
- name: greeter
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You greet people warmly.
mcp_tools:
- server: hello
Start the server. On boot you'll see Coulisse discover the server's tools and note them in the log.
Now the greeter agent can call say_hello whenever the model decides it's useful. Your client makes a normal chat completion request:
curl http://localhost:8421/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "greeter",
"safety_identifier": "user-1",
"messages": [
{"role": "user", "content": "Please greet Alice."}
]
}'
The model may call the tool one or more times; Coulisse runs the tool loop internally and returns only the final assistant message.
Under the hood, every invocation — tool name, arguments, result (or error) — is recorded against the assistant message that produced it, so you can replay the turn in the studio UI and see which tools fired and what came back. This is tool-call capture for debugging, not an extension of the OpenAI surface: the wire response your SDK receives is unchanged.
Transports
- stdio — good for local MCP servers you spawn yourself (Python scripts, Node programs, CLI tools). Coulisse manages the child process.
- http — good for long-running MCP services, especially ones shared across multiple Coulisse instances.
Both are configured the same way conceptually; see MCP tools for fields.
Scoping tools per agent
Different agents can see different subsets of tools, even from the same server:
agents:
- name: power-user
mcp_tools:
- server: filesystem # every tool the filesystem server offers
- name: read-only
mcp_tools:
- server: filesystem
only:
- read_file
- list_files # write / delete tools aren't exposed
This is Coulisse-side filtering — the model never sees the excluded tools, so it can't call them.
Tool loop limits
Coulisse caps a single request at 8 tool-call turns. If the model hasn't produced a final answer by then, the request ends. This keeps runaway loops from billing you forever.
Capture limitations
Tool-call capture only runs on the streaming path — every OpenAI SDK uses streaming for chat completions by default, so this covers normal usage. Non-streaming requests ("stream": false) still execute tools correctly; their invocations just aren't captured for the studio trail, because rig's non-streaming API doesn't expose intermediate events.
If a client disconnects mid-stream after a tool call has fired but before the result lands, the call is persisted with result: null so the studio UI still shows that the attempt happened.
Multi-backend support
Coulisse speaks to six providers out of the box:
- Anthropic
- OpenAI
- Gemini
- Cohere
- Deepseek
- Groq
You can mix them freely in a single config.
Why mix backends?
- Cost tiering. Run quick tasks on a cheap model (Groq, Haiku, gpt-4o-mini), hard tasks on a flagship.
- Capability routing. Some tasks benefit from a specific provider's strengths — long-context summarization on Gemini, coding on Sonnet, reasoning on Opus.
- Redundancy. If one provider has an outage, flip a single
providerfield to route through another. - Evaluation. A/B the same preamble on two different models without changing any client code.
One config, many backends
providers:
anthropic:
api_key: sk-ant-...
openai:
api_key: sk-...
gemini:
api_key: ...
groq:
api_key: ...
agents:
- name: quick
provider: groq
model: llama-3.3-70b-versatile
preamble: Answer briefly.
- name: smart
provider: anthropic
model: claude-opus-4-7
preamble: Think carefully.
- name: long-context
provider: gemini
model: gemini-2.0-flash
preamble: You excel at synthesizing long documents.
Your client picks one by name — everything else stays the same.
The client side is unchanged
Because Coulisse exposes an OpenAI-compatible API no matter which provider is behind an agent, your client code never has to know. You don't install the Anthropic SDK, Gemini SDK, and OpenAI SDK side by side — you just use the OpenAI SDK and change the model field.
Streaming responses
Coulisse implements OpenAI's Server-Sent Events (SSE) format for chat completions. Set stream: true in the request and the server emits incremental chat.completion.chunk frames over the wire — drop-in compatible with the OpenAI Python and JavaScript SDKs and any client that already speaks the OpenAI streaming protocol.
Asking for a stream
Add stream: true to a normal /v1/chat/completions request:
{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}
The response is text/event-stream instead of JSON. Each frame is one chat.completion.chunk.
Wire format
The first frame announces the assistant role:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"role":"assistant"}}]}
Then one frame per text delta:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":"Hello"}}]}
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":" there"}}]}
A terminal frame sets finish_reason:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Including token usage
Set stream_options.include_usage: true to receive a usage field on the terminal chunk:
{
"model": "assistant",
"messages": [{"role": "user", "content": "Hi"}],
"stream": true,
"stream_options": {"include_usage": true}
}
The terminal frame then carries usage:
data: {"...":"...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"completion_tokens":3,"prompt_tokens":7,"total_tokens":10}}
When include_usage is missing or false, the field is omitted — matching OpenAI's contract.
Memory and rate limiting
Streaming responses use the same per-user memory bucket and rate-limit accounting as non-streaming requests:
- The user's message and the assistant's reply are appended to memory after the stream ends.
- Token usage is recorded against the rate-limit window when the stream ends.
- If the client disconnects mid-stream, Coulisse persists the partial assistant reply (everything received before the disconnect). This matches what the user actually saw — the next turn won't claim the model said something the user never received.
Tool-using agents
Agents with MCP tools attached stream the same way. Tool-call internals run inside the rig multi-turn loop and are not surfaced to the client; you'll see a pause while a tool runs, then the model's text continues. The delta.content field is the only delta variant Coulisse currently emits.
Errors mid-stream
If the upstream provider fails after the stream has started, Coulisse emits one terminal frame containing an error field with the failure reason, then [DONE]. The HTTP status is already 200 by then — clients should check for the error field on the final chunk.
Rate limiting
Coulisse enforces per-user token limits across three rolling windows: hour, day, and month. Limits are set by the client, per request — not in the YAML — so callers can plug Coulisse into existing quota schemes without redeploying the server.
How it works
- Each request carries optional limit hints in its
metadatafield:tokens_per_hour,tokens_per_day,tokens_per_month. - Before the model is called, Coulisse looks up the user's current usage in each window. If any counter is already at its cap, the request is rejected with
429 Too Many Requests. - If the request passes, Coulisse runs it. On success, the total tokens consumed (request + response) are added to the user's counters.
- Counters reset on fixed boundaries: every hour, every 24 hours, every 30 days (aligned to UTC windows from the Unix epoch).
Sending limits
Put the caps in the metadata object. Values are strings (OpenAI's metadata contract), parsed as non-negative integers:
{
"model": "assistant",
"safety_identifier": "alice@example.com",
"metadata": {
"tokens_per_hour": "50000",
"tokens_per_day": "500000",
"tokens_per_month": "5000000"
},
"messages": [
{"role": "user", "content": "Hi!"}
]
}
All three keys are independent and all are optional — send only the windows you care about. Omit the whole metadata object and the request is unlimited.
When a limit is hit
The server responds with:
- Status:
429 Too Many Requests - Header:
Retry-After: <seconds>— time until the offending window resets - Body:
{
"error": {
"type": "rate_limited",
"message": "daily token limit exceeded: used 512000/500000, retry after 40213s"
}
}
The message names which window tripped (hourly, daily, monthly), how many tokens were used, the cap, and the seconds to wait.
Invalid metadata
If a metadata value isn't a valid non-negative integer, the server returns 400 Bad Request:
{
"error": {
"type": "invalid_request",
"message": "metadata key 'tokens_per_hour' must be a non-negative integer, got 'abc'"
}
}
Scope and isolation
- Per user. Each user (keyed by
safety_identifieror the fallbackuserfield) has isolated counters. - Anonymous requests can't be rate-limited. Coulisse needs an identifier. In setups with a
default_user_id(see User identification), all anonymous requests share that user's counter. - Per process. Counters live in memory. If you run multiple Coulisse instances behind a load balancer, each has its own view — for shared quotas, limit upstream (in a gateway) instead.
- Lost on restart. Counters are not persisted. This is deliberate for now; durable accounting is on the roadmap.
Why per-request limits instead of YAML?
Quotas usually live in your user/billing system, not your model-routing config. Putting limits in the request lets the caller decide — e.g. your app looks up the user's plan, fills in the numbers, and forwards the request. Coulisse just honors what you send.
Token cost tracking
Coulisse converts each chat completion's token usage into a USD cost using a vendored snapshot of LiteLLM's model pricing table. The cost lands in the per-turn llm_call event alongside the raw token counts, so the studio UI shows it next to every model call.
There's nothing to enable. As long as a turn produces token usage and the model is in the table, you'll see a $0.0042-style badge on the corresponding llm_call row in the per-turn event tree.
How it's computed
For each completion Coulisse looks up the configured (provider, model) pair in the vendored table and multiplies:
input_tokens × input_cost_per_tokenoutput_tokens × output_cost_per_tokencache_creation_input_tokens × cache_creation_input_token_cost(Anthropic prompt-cache writes)cached_input_tokens × cache_read_input_token_cost(Anthropic prompt-cache reads)
Missing fields in the upstream table are treated as zero — fine for providers like Groq that don't price cache tokens. Models that don't appear in the table at all yield a null cost: the request still succeeds, the llm_call event still records the token usage, and the studio simply omits the cost badge.
Refreshing the pricing table
The snapshot lives at crates/providers/data/model_prices.json and is checked into git. New models are added upstream regularly; refresh the snapshot with:
just refresh-prices
This downloads the latest version from LiteLLM's main branch and overwrites the local file. The diff lands in git like any other change so you can review what moved before committing.
There's no live fetching at runtime: cost lookup only ever reads from the vendored snapshot. That keeps the request path free of network dependencies and makes pricing updates an explicit, reviewable action.
What's not (yet) covered
- EUR or other currencies. Cost is stored and displayed in USD only. If there's demand for a configurable display currency (
telemetry.display_currency: { code: EUR, usd_rate: 0.92 }-style), it can be added without changing the on-disk format. - Cost-based rate limiting. Rate limits currently work on token counts. Cost is recorded but not yet enforced; a future
usd_per_day:knob would consume the same data. - Per-tool / per-MCP cost. Tool calls have their own
tool_callevents but don't carry a cost themselves. Costs are charged to the parentllm_callevent, which is the only place tokens are spent. - Custom or unlisted models. Self-hosted models or models that LiteLLM hasn't added yet won't have a price. There's no YAML override path today; if you need one, open an issue describing the use case.
Response language
Coulisse lets the caller pin the language the model replies in. Without it, the model infers language from the user's message — which can drift when the user switches languages mid-conversation or types short, ambiguous prompts. With it, every response comes back in the language you asked for.
Language is set per request, via the metadata object. The caller decides — Coulisse doesn't maintain a user-level language preference.
How to send it
Add a language key to metadata. The value is a BCP 47 tag (RFC 5646):
{
"model": "assistant",
"safety_identifier": "user-123",
"metadata": {
"language": "fr-FR"
},
"messages": [
{"role": "user", "content": "Hello!"}
]
}
Any valid BCP 47 tag works: en, fr, fr-FR, es-MX, zh-Hant, ja-JP. The tag is validated — malformed values come back as 400 Bad Request. Omit the key entirely to let the model pick.
How it reaches the model
Coulisse appends a short instruction to the system preamble before calling the provider — something like Always reply in French, even when the user writes in a different language. Do not include translations in any other language.. The instruction is phrased as a hard constraint so the model doesn't mirror the user's language or append a parenthetical translation. For tags in the built-in language-name table (common ISO 639-1 subtags: en, fr, es, de, it, pt, ja, zh, ko, ar, nl, pl, ru, sv, tr, hi), the instruction uses the English name. For anything else, the raw tag is passed through — frontier models understand BCP 47 directly, so cy (Welsh) works fine.
The instruction is added once per request, as the first system message. Your own system messages in the messages array still apply, and agent preambles from coulisse.yaml are preserved.
Real-world example: country code to language
A common pattern is to derive the language from the caller's locale on your side — phone country code, IP-based geolocation, browser Accept-Language, a user profile setting — and forward the resulting tag:
{
"model": "assistant",
"safety_identifier": "+33612345678",
"metadata": {
"language": "fr-FR"
},
"messages": [
{"role": "user", "content": "What's the weather?"}
]
}
Coulisse doesn't do the mapping itself. It takes the tag you send and asks the model to respond in that language. That keeps the metadata format stable and the country-code-to-language table (which changes slowly but does change) out of server code.
Errors
A malformed tag returns 400 Bad Request:
{
"error": {
"type": "invalid_request",
"message": "invalid `metadata.language`: invalid language tag: ..."
}
}
Empty-string and whitespace-only values are rejected the same way.
LLM-as-judge evaluation
Coulisse can score every agent reply with a separate LLM — a judge — and persist the result so you can track quality over time. You describe what to evaluate in the YAML rubric; Coulisse handles scoring shape, format, sampling, and storage.
This is useful for watching agent drift, comparing model/preamble changes, and catching regressions without standing up a separate evaluation pipeline.
How it works
- A client sends a chat request. The agent replies as usual — the judge never blocks the response.
- After the reply is persisted, Coulisse runs each judge the agent opted in to, in a background task.
- Each judge samples according to its
sampling_rate(skip entirely if the draw misses), then asks its backing model to score the assistant's reply against every rubric at once. - The response is parsed into one
scorerow per rubric — persisted under the same user id as the conversation. - Failures (bad JSON, provider error, timeout) are logged at
warnand swallowed — the user already got their answer.
Scores are stored in the same SQLite database as messages and memories, in a scores table keyed by message_id. Averages are computed at read time, not aggregated on write.
YAML
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are a helpful assistant.
judges: [quality] # opt in by name
- name: translator
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: Translate into French.
judges: [fluency]
judges:
# Cheap, broad check — 100% of turns, small model.
- name: quality
provider: openai
model: gpt-4o-mini
sampling_rate: 1.0
rubrics:
accuracy: Factual accuracy. Flag hallucinations.
helpfulness: Whether the assistant answered the user's question.
tone: Politeness and tone.
# Targeted check for the translator — only 20% of turns.
- name: fluency
provider: openai
model: gpt-4o-mini
sampling_rate: 0.2
rubrics:
grammar: Grammatical correctness of the French output.
naturalness: How native the phrasing sounds.
The wiring is visible from the agent: when you read an agent block you see which judges score it, rather than having to hunt through the judge list to figure out coverage.
Rubrics
A rubric is a map from criterion name to a short description of what to assess.
rubrics:
accuracy: Factual accuracy. Flag hallucinations.
helpfulness: Whether the assistant answered the user's question.
Keep descriptions terse and assess-able. Don't write scale, format, or JSON instructions into them — Coulisse adds those internally. The description should tell the judge what matters, not how to answer.
Each criterion produces one Score row per scored turn, with its own numeric value and short reasoning. All criteria for one judge are evaluated in a single LLM call, so adding criteria to a judge doesn't multiply cost.
Scoring shape
Every score is an integer in 0..=10 with a one-sentence reasoning. Coulisse forces this shape through the preamble and parses the judge's JSON reply — you don't configure it.
If you need a different scale (e.g. boolean pass/fail, categorical), that will arrive as a future scale: field; the default stays numeric 0-10.
Sampling
sampling_rate controls what fraction of turns are scored.
| Value | Meaning |
|---|---|
1.0 (default) | Score every turn. |
0.1 | Roughly 10% of turns. |
0.0 | Never score (useful to park a judge without deleting it). |
The draw is independent per turn, per judge. Over many turns the scored fraction converges on the configured rate. Lower rates save tokens for expensive judges; broad cheap judges can run at 1.0.
Choosing a judge model
Pick a model that's different from the agent being scored whenever you can. A judge scoring its own output is biased — a cheap cross-provider judge (e.g. gpt-4o-mini judging a Claude agent, or vice versa) is usually closer to neutral.
Strong, slow models make sense for low-volume deep checks (sampling_rate: 0.1). Cheap, fast models make sense for high-volume broad checks (sampling_rate: 1.0).
Multiple judges per agent
Stack judges to get different dimensions at different cost points:
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
judges: [broad_check, deep_audit]
judges:
- name: broad_check
provider: openai
model: gpt-4o-mini
sampling_rate: 1.0
rubrics:
helpfulness: Whether the user's question was answered.
tone: Politeness and tone.
- name: deep_audit
provider: anthropic
model: claude-opus-4
sampling_rate: 0.05 # 5% of turns, expensive
rubrics:
accuracy: Factual accuracy, including references and claims.
safety: Harmful, biased, or unsafe content.
Each judge is independent — its own model, rate, and rubric set. A turn can end up with zero, one, or both of these judges scoring it, depending on the sampling draw.
Viewing scores
The studio UI at /admin/ now shows a Scores panel per user. It surfaces two things:
- Averages — mean score per
(judge, criterion)across every turn the user has had, with sample count. - Recent — the most recent individual scores with reasoning.
Validation at startup
Coulisse fails fast on:
- A judge referencing a provider that's not declared under
providers:. - A judge with no rubrics.
- A
sampling_rateoutside[0.0, 1.0]. - An agent referencing a judge name that doesn't exist.
Any violation aborts startup with a message naming the offending judge or agent.
Cost control
Two knobs matter:
sampling_rate— the easy one. Halve it, halve the judge bill.- Judge model — the big one. A
gpt-4o-minijudge at 100% sampling often costs less than agpt-4ojudge at 10%. Pick the cheapest model that gives you a stable signal.
A useful pattern is to run a cheap judge at 100% and a strong judge at a small fraction — the cheap one catches the broad signal, the strong one spot-checks the hardest cases.
Experiments (A/B testing)
Run multiple agent configurations under a single addressable name and let Coulisse pick which one serves each request. Useful for comparing models, preambles, or tool sets without changing client code.
How it works
- Define each candidate as a normal agent under
agents:. - Declare an
experimentwhosenameis what clients send asmodel. - List the candidate agents as variants and choose a strategy.
When a request arrives, the router resolves the experiment name to one variant (and optionally fires off shadow runs in the background). The variant choice is sticky-by-user by default, so the same user always lands on the same variant for a given experiment — conversation memory and persona stay consistent across turns.
Strategies
Three strategies are wired today: split, shadow, and bandit.
split
Weighted random sampling. Sticky by user when sticky_by_user: true (the default) — the variant is a deterministic hash of (user_id, experiment_name) modulo the cumulative weights, with no database writes. Adding or removing a variant reshuffles users.
agents:
- name: assistant-sonnet
provider: anthropic
model: claude-sonnet-4-5-20250929
- name: assistant-gpt
provider: openai
model: gpt-4o
experiments:
- name: assistant # what clients send as model
strategy: split
variants:
- agent: assistant-sonnet
weight: 0.5
- agent: assistant-gpt
weight: 0.5
shadow
Designate one variant as primary; it serves the user normally. The other variants run in the background against the same prepared context, are scored by their judges, and never write to the user's message history. The user never waits on shadow variants.
sampling_rate (default 1.0) controls how often shadow runs fire — set it lower to cap cost.
experiments:
- name: assistant
strategy: shadow
primary: assistant-sonnet
sampling_rate: 0.25 # 25% of turns also run the shadows
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
Use shadow to collect comparison data before flipping a split rollout — the primary serves all real traffic while you build up scoring evidence on the challenger.
bandit
Epsilon-greedy multi-armed bandit. Reads recent mean scores per variant from the existing scores table, picks the leader most of the time (1 - epsilon), and explores a random arm otherwise. Arms with fewer than min_samples recent scores are forced — the bandit only exploits once every arm has enough evidence.
agents:
- name: assistant-sonnet
provider: anthropic
model: claude-sonnet-4-5-20250929
judges: [quality]
- name: assistant-gpt
provider: openai
model: gpt-4o
judges: [quality]
judges:
- name: quality
provider: openai
model: gpt-4o-mini
rubrics:
helpfulness: Whether the assistant answered the user's question.
experiments:
- name: assistant
strategy: bandit
metric: quality.helpfulness # judge.criterion
epsilon: 0.1
min_samples: 30
bandit_window_seconds: 604800 # 7 days
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
The configured judge (quality) and the criterion (helpfulness) must be declared on every variant agent — otherwise the bandit starves on that arm. Validation enforces this at startup.
A note on stickiness: with sticky_by_user: true (the default), the bandit decision is computed at request time via a deterministic hash of (user_id, experiment_name), so a given user typically lands on the same arm. Mean scores update as new data arrives, so a user can shift if a different arm overtakes the leader — that is the trade-off for keeping the assignment stateless.
Namespace and migration
Experiment names share a namespace with agent names. To A/B-test an existing agent without breaking clients:
- Rename the agent (
assistant→assistant-v1). - Add a sibling agent (
assistant-v2). - Add an experiment named
assistantwith both as variants.
Clients keep sending model: assistant and it resolves transparently.
Variants stay individually addressable as agents under their own names (assistant-v1, assistant-v2) — useful for isolating one variant in tests or debugging.
Subagents
A subagent reference can name an agent or an experiment. If orchestrator lists subagents: [assistant] and assistant is an experiment, every subagent call resolves to a variant for the calling user, the same way a top-level request would. Sticky-by-user keeps the variant consistent across the whole conversation.
Give the experiment a purpose: if it's exposed as a subagent — it becomes the tool description the calling agent's LLM sees:
experiments:
- name: assistant
purpose: A general-purpose chat assistant.
strategy: split
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
Bandit subagents read mean scores at call time, so the same exploit/explore behaviour applies inside subagent dispatch.
Telemetry
Each turn's TurnStart event includes agent (the resolved variant), and when an experiment was hit, experiment (the experiment name) and variant (same as agent). Judge scores are tagged with the variant's agent name in the database, so per-variant aggregation flows through the same table without a join — used by the bandit's mean-score query and the studio's per-variant view.
Studio
The studio shows configured experiments at /admin/experiments: strategy, sticky-by-user flag, and per-variant weight + share. For bandit experiments, the page additionally shows the configured metric, epsilon, and min-samples threshold, plus per-variant sample counts and mean scores (loaded inline via htmx from the judges admin endpoints). Shadow experiments call out the primary variant.
Validation
Coulisse rejects the following at startup:
- Experiment name colliding with an agent name (rename one).
- Experiment name colliding with another experiment.
- Experiment with zero variants.
- Variant referencing an undefined agent.
- Variant weight
<= 0. - Duplicate variant agent within one experiment.
- Strategy-specific fields used with the wrong strategy (e.g.
primaryon asplitexperiment). shadowwithout aprimary, or with aprimarythat's not one of the variants.shadowsampling_rateoutside[0.0, 1.0].banditwithout ametric.banditmetricthat doesn't match an existingjudge.criterion, or a variant that doesn't opt into the metric's judge.banditepsilonoutside[0.0, 1.0].
Smoke tests
A smoke test is a synthetic-user persona that drives a conversation against one of your agents (or experiments). Coulisse plays the user — you write a preamble describing who they are and what they want — and the assistant replies for real. Every assistant turn flows through the same judge pipeline as production traffic, so you get a transcript and scores back without writing any harness code.
Smoke tests are most useful when you're iterating on a prompt: tweak the preamble, hit "Run now" in the studio, watch the scores. Pair them with experiments and a single click runs every variant once, sticky-by-user routing samples them across repetitions, and the judge scores feed straight into bandit selection.
How it works
- You trigger a run from the studio (
/admin/smoke/<name>) — no client needed. - Coulisse opens a fresh synthetic user id and starts a loop:
- The persona model produces a "user" message — given the conversation so far with roles flipped (so the model speaks as the user).
- The target agent replies as it normally would, with all its real MCP tools, subagents, and preambles.
- The reply is fanned out to every judge the target agent opts into. Scores land in the same
scorestable as production runs, keyed by the assistant turn's id.
- The loop stops when either side emits the configured
stop_marker, or whenmax_turnsis hit. - The full transcript is browsable at
/admin/smoke/runs/<run_id>— assistant in slate, persona in amber.
Smoke runs never write to the user's memory or rate-limit windows. Each repetition uses a brand-new synthetic user id, so split/bandit experiments naturally sample variants across reps.
YAML
smoke_tests:
- name: jobseeker_basic
target: tremplin # agent or experiment name
persona:
provider: anthropic
model: claude-haiku-4-5-20251001
preamble: |
You are role-playing a 28-year-old looking for a developer job in Paris.
Reply like a real human: short questions, follow-ups as the conversation goes.
When you have a satisfactory answer, finish with "[FIN]".
initial_message: "Hi, I'm looking for work."
stop_marker: "[FIN]"
max_turns: 10
repetitions: 5
| Field | Required | Default | Notes |
|---|---|---|---|
name | yes | Unique within smoke_tests. Shows up at /admin/smoke/<name>. | |
target | yes | Agent name or experiment name. Resolved through the experiment router per run. | |
persona | yes | Provider, model, and preamble for the synthetic user. | |
initial_message | no | Hard-coded first message from the persona. Skipping this lets the persona open the conversation. | |
stop_marker | no | Substring that ends the run when emitted by either side. | |
max_turns | no | 10 | Cap on persona-then-agent pairs. |
repetitions | no | 1 | Independent runs launched per "Run now" click. Each gets a fresh synthetic user id. |
Iterating with experiments
Define two variants of an agent (e.g. assistant-v1, assistant-v2), wrap them in a bandit experiment, and target the experiment name from a smoke test:
experiments:
- name: assistant
strategy: bandit
metric: quality.helpfulness
variants:
- agent: assistant-v1
- agent: assistant-v2
smoke_tests:
- name: convergence
target: assistant
repetitions: 50
persona: { provider: openai, model: gpt-4o-mini, preamble: "..." }
Hit "Run now" once and the bandit accumulates 50 samples per variant per turn pair. The experiment page picks the winner on its own.
Limitations (today)
- Smoke runs bypass the memory pipeline. Fact extraction and semantic recall are not exercised.
- No scheduled runs — trigger is manual via the studio.
- No tool-call assertions; assertions about what the agent did during a turn live in the judge rubrics.
Telemetry
Coulisse emits its own observability via the tracing crate. Every request opens a turn span; every tool invocation (MCP or subagent) opens a child tool_call span. The configured layers — fmt, SQLite, and optionally OTLP — receive those spans and route them where you've asked for.
The result: the studio UI gives you an offline audit trail, and any OpenTelemetry-compatible backend (Grafana, SigNoz, Jaeger, Honeycomb, ...) gives you live traces. They're driven from the same source — there's no separate path.
Span model
| Span name | Opened when | Fields |
|---|---|---|
turn | a chat completion request arrives | agent, experiment (when applicable), turn_id, user_id, user_message |
tool_call | an MCP or subagent tool fires | args, error (on failure), kind (mcp/subagent), result, tool_name |
llm_call | a chat completion finishes (token usage is known) | cost_usd (when the model is in the pricing table), model, provider, usage |
turn is the root; tool_call and llm_call nest under it via the tracing span tree, so OTLP backends render them as a trace tree out of the box.
Studio integration
When telemetry.sqlite.enabled is true (the default), the studio's per-turn event tree and tool-call panel render directly from the same spans. Nothing extra to wire up — open /admin/ and the tree is there.
OTLP backends
Set telemetry.otlp.endpoint to start exporting. The exporter batches spans, retries on transient failures, and shuts down cleanly on process exit so in-flight spans land before the server stops.
Tested with:
- Grafana (Tempo / Cloud) — gRPC at
4317. - SigNoz (self-hosted or Cloud) — gRPC; for Cloud add a
signoz-access-tokenheader. - Jaeger — gRPC at
4317(Jaeger ≥ 1.50 speaks OTLP natively). - Honeycomb — HTTP/protobuf at
https://api.honeycomb.io/v1/traceswithx-honeycomb-teamheader.
Tuning verbosity
The fmt layer (stderr logs) is controlled by RUST_LOG:
RUST_LOG=info,sqlx=warn coulisse # default
RUST_LOG=debug coulisse # verbose, including SQL driver
RUST_LOG=warn coulisse # quiet
RUST_LOG=coulisse=debug,agents=trace coulisse # per-crate filtering
The SQLite and OTLP layers are not affected by RUST_LOG — they capture every turn / tool_call / llm_call span regardless of log level.
Disabling layers
Each layer has its own enabled flag. Common combinations:
# Production with external observability stack
telemetry:
sqlite:
enabled: false # studio not exposed; no need to keep DB rows
otlp:
endpoint: "..."
# Local development, no external backend
telemetry:
# default fmt + sqlite
# CI / load tests — minimize logging overhead
telemetry:
fmt:
enabled: false
sqlite:
enabled: false
CLI reference
Coulisse ships as a single binary with a handful of subcommands. Every
subcommand accepts -c, --config <PATH> (default coulisse.yaml) and
honors the COULISSE_CONFIG env var as a fallback.
State files (coulisse.pid, coulisse.log) live in a .coulisse/
directory next to the config file — this keeps state co-located with
the project and makes cd && coulisse stop "just work."
coulisse init
Write a starter coulisse.yaml in the current directory.
coulisse init # minimal template (one OpenAI agent + sqlite memory)
coulisse init --from-example # full annotated example (every section, every option)
coulisse init --force # overwrite an existing coulisse.yaml
coulisse start
Start the server, detached by default. Returns once the server has written its PID file (or fails if the boot times out within 5 seconds).
coulisse start # detached background server
coulisse start --foreground # attached: logs stream to the terminal
coulisse start -F # short form
A bare coulisse invocation is equivalent to coulisse start --foreground — the historical pre-subcommand behavior is preserved.
When detached, stdout/stderr are appended to .coulisse/coulisse.log.
coulisse stop
Send SIGTERM to a running detached server (PID read from
.coulisse/coulisse.pid).
coulisse stop # graceful: SIGTERM, wait up to 10s
coulisse stop --force # SIGKILL (use if the server is wedged)
Stop is a no-op if the server isn't running — stale PID files left over from crashes are detected and removed.
coulisse restart
Equivalent to coulisse stop && coulisse start.
coulisse status
Report whether the detached server is running and where its files live.
running (pid 31427)
config: ./coulisse.yaml
log: ./.coulisse/coulisse.log
coulisse check
Load and validate the YAML without starting the server. Catches schema errors and cross-reference issues (agent → provider, agent → judge, experiment variant → agent, ...) before a real start.
coulisse check
# ok — coulisse.yaml (3 agents, 1 judges, 0 experiments, 2 providers)
coulisse update
Fetch the latest release from GitHub and replace the running binary
in place. Detects the host target triple (e.g.
aarch64-apple-darwin) and downloads the matching cargo-dist
artifact. No-op if you're already on the latest version.
coulisse update
# checking for updates...
# updated to 0.2.0
The binary needs write permission to its own path — if you installed
under /usr/local/bin you may need sudo.
State directory layout
your-project/
├── coulisse.yaml
└── .coulisse/
├── coulisse.pid # written by `start`, removed on clean exit
├── coulisse.log # detached stdout/stderr
└── memory.db # if you point memory.backend.path here
.coulisse/ is the recommended target for memory.backend.path so
the whole runtime footprint of one project sits under a single
directory.
HTTP API
Coulisse listens on 0.0.0.0:8421 and exposes an OpenAI-compatible surface.
POST /v1/chat/completions
The main chat endpoint. Accepts the standard OpenAI chat completion request shape.
Request
{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [
{"role": "user", "content": "Hello!"}
]
}
| Field | Required | Notes |
|---|---|---|
messages | yes | The usual OpenAI message array. At least one user message is required. |
metadata | no | Optional map of strings. Used for per-request rate limits — see below. |
model | yes | Name of an agent from your config. |
safety_identifier | yes¹ | Identifies the user. Can be any stable string. |
stream | no | When true, the response is an SSE stream of chat.completion.chunk frames. See Streaming responses. |
stream_options | no | Object. include_usage: true adds the usage field to the terminal stream chunk. |
user | — | Deprecated OpenAI field; accepted as a fallback. |
¹ Required unless a default_user_id is set in coulisse.yaml — see User identification.
Recognized metadata keys
metadata is a passthrough map of strings. Coulisse interprets the following keys; any other keys are ignored.
| Key | Type | Meaning |
|---|---|---|
language | BCP 47 tag | Forces the response language, e.g. fr-FR. See Response language. |
tokens_per_day | integer (as string) | Max tokens per rolling day. |
tokens_per_hour | integer (as string) | Max tokens per rolling hour. |
tokens_per_month | integer (as string) | Max tokens per rolling 30-day window. |
All optional. See Rate limiting for the token-limit behavior.
Response
Standard OpenAI chat completion response:
{
"id": "...",
"object": "chat.completion",
"created": 1714000000,
"model": "assistant",
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": "Hi!"},
"finish_reason": "stop"
}
]
}
Streaming
Set stream: true to receive chat.completion.chunk frames over Server-Sent Events instead of one JSON response. The full wire format and disconnect semantics live in Streaming responses.
Errors
Errors come back in OpenAI's error shape:
{
"error": {
"type": "invalid_request_error",
"message": "safety_identifier is required",
"code": null
}
}
Common cases:
- 400 — missing
safety_identifier(when required), no user message, unknown agent name, unparseablemetadatavalues. - 429 — per-user token limit exceeded. Includes a
Retry-Afterheader with seconds until the window resets. See Rate limiting. - 5xx — upstream provider error, MCP server failure.
GET /v1/models
Lists every agent defined in the config.
Response
{
"object": "list",
"data": [
{"id": "assistant", "object": "model", "owned_by": "coulisse"},
{"id": "code-reviewer", "object": "model", "owned_by": "coulisse"}
]
}
Useful for UI dropdowns that want to populate a model picker from the server.
Admin / config endpoints
Everything under /admin/* is a single content-negotiated surface. The same routes serve HTML pages to browsers, HTML fragments to htmx, and JSON to scripts — set Accept: application/json (or send an HX-Request header) to switch representation. Request bodies are equally tolerant: application/json, application/yaml, and application/x-www-form-urlencoded all deserialize into the same target type.
All admin routes are gated by the auth.admin scope.
Agents
| Method | Path | Body | Notes |
|---|---|---|---|
GET | /admin/agents | — | List configured agents (HTML or JSON). |
POST | /admin/agents | AgentConfig | Create a new agent. 409 if the name is taken. |
GET | /admin/agents/{name} | — | Detail (HTML or JSON). |
PUT | /admin/agents/{name} | AgentConfig | Replace the named agent. Body name must match URL. |
DELETE | /admin/agents/{name} | — | Remove the named agent. |
GET | /admin/agents/new | — | HTML form for a new agent. |
GET | /admin/agents/{name}/edit | — | HTML edit form. |
AgentConfig is the same shape used in coulisse.yaml: name, provider, model, preamble, purpose (optional), judges (list, optional), subagents (list, optional), mcp_tools (list, optional).
Judges, experiments, providers, MCP servers
Same CRUD shape as agents — list / create / one / update / delete. Adjust the path to suit:
| Path | Body | Notes |
|---|---|---|
/admin/judges + /admin/judges/{name} | JudgeConfig | LLM-as-judge evaluators. |
/admin/experiments + /admin/experiments/{name} | ExperimentConfig | A/B routing groups. The runtime ExperimentRouter rebuilds on restart; admin display reflects the file in real time. |
/admin/providers + /admin/providers/{kind} | ProviderConfig (just api_key); POST body adds kind | Where {kind} is one of anthropic, cohere, deepseek, gemini, groq, openai. The runtime client is built at boot — restart to swap. |
/admin/mcp + /admin/mcp/{name} | McpServerConfig (transport: stdio + command/args/env, or transport: http + url); POST body adds name | Connections open at boot — restart to attach a new server. |
Whole-file config
| Method | Path | Body | Notes |
|---|---|---|---|
GET | /admin/config | — | Returns the file contents (application/yaml by default, JSON when Accept: application/json). |
PUT | /admin/config | full YAML/JSON | Replaces coulisse.yaml atomically. Validation runs before any disk write. |
GET | /admin/openapi.json | — | OpenAPI 3.1 description of every admin route, including request/response schemas. Feed it to openapi-generator or any client codegen for typed SDKs. |
Validation, hot reload, the file watcher
Every write — admin form save, JSON PUT, hand-edit in $EDITOR — flows through the same pipeline:
- The body is merged into the on-disk YAML (preserving sections this binary doesn't recognize).
- The full result is deserialized into a
Configand run through cross-feature validation (provider references, judge references, experiment variants, …). - Only on success does anything touch disk: a temp file is written and renamed atomically.
- The file watcher fires, the new config is reloaded, and feature crates' hot-reloadable state (agent list, judges list, experiments list, settings view) atomically swaps in.
Errors return the validator's message verbatim with a 422 Unprocessable Entity (or 400 for malformed bodies). The on-disk file is unchanged when validation rejects a write.
The studio UI is just one client of these endpoints — see Studio UI for what the rendered surface offers and authentication options.
Auth
By default Coulisse leaves /v1/* open. Configure the auth.proxy scope in YAML to require Basic credentials or OIDC for SDK clients; configure auth.admin to gate the studio. See Studio UI for the schema. Anything you don't gate is your responsibility to terminate at the infrastructure layer (reverse proxy, API gateway, VPN).
YAML schema
A complete reference for every field in coulisse.yaml.
Top-level
agents: [ ... ] # required, non-empty
auth: { ... } # optional; per-scope auth for /v1/* and /admin/*
default_user_id: <string> # optional, unset by default
experiments: [ ... ] # optional; A/B test groups over agents
judges: [ ... ] # optional; empty/omitted = no evaluation
mcp: { ... } # optional
memory: { ... } # optional; defaults to sqlite + hash embedder
providers: { ... } # required
smoke_tests: [ ... ] # optional; synthetic-user evaluation runs
telemetry: { ... } # optional; fmt + sqlite on by default, OTLP opt-in
auth
- Type: object
- Optional. Omit to leave both surfaces unauthenticated (fine for local dev, never for anything exposed beyond loopback).
Two independent scopes:
auth.proxyguards the OpenAI-compatible/v1/*surface that SDK clients call.auth.adminguards the/admin/*surface (the studio UI).
Each scope is itself optional and accepts the same shape: exactly one of basic or oidc when present. They are mutually exclusive within a scope — the server rejects a scope block that has both or neither. The two scopes are independent, so you can enable Basic on one and OIDC on the other.
auth.<scope>.basic
Static HTTP Basic credentials. Best for local dev or a single-operator deployment.
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
password | string | yes | — | Non-empty. Rotate if suspected leaked — there's no token revocation. |
username | string | no | admin | Non-empty when set. |
auth:
admin:
basic:
password: choose-something-strong
username: admin
auth.<scope>.oidc
Authorization-code-with-PKCE login against an OIDC-compliant IdP (Authentik, Keycloak, Auth0, Google, etc.). Access control is delegated to the IdP's application policy — Coulisse accepts any successfully authenticated user.
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
client_id | string | yes | — | Must match the client registered at the IdP. |
client_secret | string | no | — | Required for confidential clients (Authentik's default); omit for public clients using PKCE only. |
issuer_url | string | yes | — | IdP issuer. For Authentik: https://<host>/application/o/<app-slug>/. |
redirect_url | string | yes | — | Public base URL inside the protected scope. Must be registered as the redirect URI at the IdP. axum-oidc allows every subpath of this URL as a valid redirect. |
scopes | list<string> | no | [email, profile] | Extra OAuth2 scopes. openid is added automatically. |
auth:
admin:
oidc:
issuer_url: https://authentik.example.com/application/o/coulisse/
client_id: coulisse-admin
client_secret: <secret>
redirect_url: http://localhost:8421/admin/
default_user_id
- Type: string
- Default: unset
- Purpose: fallback identifier for requests that don't supply
safety_identifier(or the deprecateduser).
Leave it unset for multi-tenant deployments — unidentified requests will be rejected. Set it to something like "main" for local or single-user setups so memory still works whether or not the client bothers to send an id. See User identification.
providers
- Type: map of
provider_kind → provider_config - Required. At least one provider must be declared.
Supported keys
anthropic, cohere, deepseek, gemini, groq, openai.
Per-provider fields
| Field | Type | Required | Notes |
|---|---|---|---|
api_key | string | yes | Provider API key. |
providers:
anthropic:
api_key: sk-ant-...
openai:
api_key: sk-...
mcp
- Type: map of
server_name → server_config - Optional. Omit if you don't use tools.
Server names are arbitrary — they're what agents refer to under mcp_tools.
Common fields
| Field | Type | Required | Notes |
|---|---|---|---|
transport | enum | yes | stdio or http. |
transport: stdio
| Field | Type | Required | Notes |
|---|---|---|---|
command | string | yes | Executable to run. |
args | list<str> | no | Command-line arguments. |
env | map<str,str> | no | Environment variables for the child. |
transport: http
| Field | Type | Required | Notes |
|---|---|---|---|
url | string | yes | Streamable-HTTP MCP endpoint. |
Examples
mcp:
hello:
transport: stdio
command: uvx
args: [--from, git+https://..., hello-mcp-server]
calculator:
transport: http
url: http://localhost:8080
memory
- Type: object
- Optional. Omit for defaults (sqlite at
./coulisse-memory.db, offlinehashembedder, no auto-extraction).
See Memory configuration for the full walkthrough and examples.
Sub-fields
| Field | Type | Required | Default |
|---|---|---|---|
backend.kind | enum | no | sqlite |
backend.path | string | no | ./coulisse-memory.db |
embedder.provider | enum | no | hash |
embedder.model | string | depends | required for openai/voyage |
embedder.api_key | string | no | falls back to providers.<provider> |
embedder.dims | int | no | 32 (hash only) |
extractor.provider | string | yes* | — (* required when extractor is set) |
extractor.model | string | yes* | — |
extractor.dedup_threshold | float | no | 0.9 |
extractor.max_facts_per_turn | int | no | 5 |
context_budget | int | no | 8000 |
memory_budget_fraction | float | no | 0.1 |
recall_k | int | no | 5 |
agents
- Type: list of agent configs
- Required. At least one agent must be defined.
Per-agent fields
| Field | Type | Required | Notes |
|---|---|---|---|
name | string | yes | Unique agent identifier; clients pass this as model. |
provider | string | yes | Key under providers. |
model | string | yes | Upstream model identifier. |
preamble | string | no | System prompt. Default: empty. |
judges | list<string> | no | Names of judges (from top-level judges:) that evaluate this agent's replies. Empty = no evaluation. |
mcp_tools | list<mcp_tool_access> | no | Tools this agent may use. |
purpose | string | no | Tool description when this agent is exposed via another agent's subagents. Omit for standalone agents; add a concrete one-line description when this agent is meant to be called as a specialist. |
subagents | list<string> | no | Names of other agents exposed as callable tools. Each entry must refer to another entry under agents. Self-reference and duplicates are rejected at startup. |
mcp_tools entry
| Field | Type | Required | Notes |
|---|---|---|---|
server | string | yes | Key under mcp. |
only | list<str> | no | Allowed tool names. Omit for full access. |
Complete agent example
agents:
- name: code-reviewer
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: |
You are a thorough code reviewer.
mcp_tools:
- server: filesystem
only:
- read_file
- server: hello
Subagent example
agents:
- name: resume_critic
provider: anthropic
model: claude-sonnet-4-5-20250929
purpose: Critique and rewrite a resume for a target role.
preamble: |
Given a resume and a target role, return a revised resume
and a bullet list of the biggest gaps.
- name: coach
provider: anthropic
model: claude-sonnet-4-5-20250929
subagents: [resume_critic]
preamble: |
Delegate resume work to `resume_critic` when relevant.
See Multi-agent routing for the full subagent walkthrough.
experiments
- Type: list of experiment configs
- Optional. Omit (or leave empty) to skip A/B testing.
An experiment wraps two or more agents under one addressable name. Clients send the experiment's name in the model field and the router picks a variant per request. Experiment names share the agent namespace — collisions are rejected at startup.
See Experiments for the end-to-end walkthrough.
Per-experiment fields
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
bandit_window_seconds | int | no (bandit) | 604800 (7 d) | Bandit-only. Maximum age of scores included in mean-arm computations. |
epsilon | float | no (bandit) | 0.1 | Bandit-only. Probability in [0.0, 1.0] of routing to a random arm instead of the leader. |
metric | string | yes (bandit) | — | Bandit-only. judge.criterion to optimise. The judge must declare the criterion in its rubrics, and every variant must opt into the judge. |
min_samples | int | no (bandit) | 30 | Bandit-only. Each arm must accumulate this many scores before exploitation is allowed. |
name | string | yes | — | Addressable name; must not collide with any agent name. |
primary | string | yes (shadow) | — | Shadow-only. Variant agent that serves the user. Must be one of variants. |
purpose | string | no | — | Tool description when the experiment is exposed via another agent's subagents:. |
sampling_rate | float | no (shadow) | 1.0 | Shadow-only. Probability in [0.0, 1.0] that a turn also runs the non-primary variants in the background. |
sticky_by_user | bool | no | true | When true, the same user always lands on the same variant (deterministic hash, no DB writes). |
strategy | enum | yes | — | split, shadow, or bandit. |
variants | list<variant> | yes | — | Non-empty. Each entry references an agent. |
variants entry
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
agent | string | yes | — | Name of an agent declared under top-level agents:. Variants must reference concrete agents — nesting an experiment is rejected. |
weight | float | no | 1.0 | Strictly positive. Normalised against the sum of all variant weights. |
Example
agents:
- name: assistant-sonnet
provider: anthropic
model: claude-sonnet-4-5-20250929
- name: assistant-gpt
provider: openai
model: gpt-4o
experiments:
- name: assistant
strategy: split
variants:
- agent: assistant-sonnet
weight: 0.5
- agent: assistant-gpt
weight: 0.5
judges
- Type: list of judge configs
- Optional. Omit (or leave empty) for no automatic evaluation.
Judges are background LLM-as-judge evaluators. An agent opts in by listing judge names in its own judges: field. See LLM-as-judge evaluation for the full walkthrough.
Per-judge fields
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
name | string | yes | — | Unique judge identifier; agents refer to it here. |
provider | string | yes | — | Must match a key under providers. |
model | string | yes | — | Upstream model identifier for the judge call. |
rubrics | map<string,string> | yes | — | criterion: short description of what to assess. One score row per criterion per scored turn. Must declare at least one entry. |
sampling_rate | float | no | 1.0 | In [0.0, 1.0]. 1.0 = every turn, 0.1 ≈ 10%, 0.0 = never. |
Rubric descriptions should say what to evaluate — don't include scale, JSON, or format instructions. Coulisse forces the output shape internally (integer 0-10 per criterion with a one-sentence reasoning).
Example
judges:
- name: quality
provider: openai
model: gpt-4o-mini
sampling_rate: 1.0
rubrics:
accuracy: Factual accuracy. Flag hallucinations.
helpfulness: Whether the assistant answered the user's question.
tone: Politeness and tone.
smoke_tests
- Type: list of smoke test configs
- Optional. Omit (or leave empty) for no synthetic-user runs.
Each entry pairs a persona (an LLM that role-plays the user) with a target agent or experiment. Triggered from the studio at /admin/smoke/<name>. See Smoke tests for the workflow.
Per-test fields
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
name | string | yes | — | Unique within smoke_tests. |
target | string | yes | — | Agent or experiment name. Resolved per run via the experiment router. |
persona | object | yes | — | provider, model, preamble for the role-played user. |
initial_message | string | no | — | Hard-coded first persona turn. Omit to let the persona open the conversation. |
stop_marker | string | no | — | Substring that ends the run when emitted by either side. |
max_turns | integer | no | 10 | Cap on persona-then-agent pairs per run. |
repetitions | integer | no | 1 | Independent runs launched per click. Each gets a fresh synthetic user id. |
Example
smoke_tests:
- name: jobseeker_basic
target: tremplin
persona:
provider: anthropic
model: claude-haiku-4-5-20251001
preamble: |
You are a 28-year-old looking for a developer job in Paris.
Reply like a real human; finish with "[FIN]" once satisfied.
initial_message: "Hi, I'm looking for work."
stop_marker: "[FIN]"
max_turns: 10
repetitions: 5
telemetry
- Type: object
- Optional. Omit and Coulisse runs with stderr fmt logs at
infoplus the SQLite mirror that drives the studio UI; no external traces.
The block has three sub-sections — fmt, sqlite, and otlp — each independently toggleable. See Telemetry configuration for the full schema and Telemetry & OpenTelemetry for span semantics and OTLP backend integration.
telemetry:
fmt:
enabled: true # default
sqlite:
enabled: true # default; powers the studio UI
otlp: # absent = no external traces
endpoint: "http://localhost:4317"
protocol: grpc # or http_binary
service_name: coulisse
headers:
authorization: "Bearer ${OTEL_API_KEY}"
Validation
On startup, Coulisse checks:
- Each present
authscope (proxy,admin) declares exactly one ofbasicoroidc. auth.<scope>.basic.passwordandauth.<scope>.basic.usernameare non-empty.auth.<scope>.oidc.client_id,issuer_url, andredirect_urlare non-empty.- There is at least one agent.
- Agent names are unique.
- Every agent's
provideris configured. - Every referenced MCP server is configured.
- Every name in
subagentsrefers to a defined agent or experiment. - No agent lists itself under
subagents. subagentsentries are unique within an agent (no duplicates).- Experiment names are unique and do not collide with any agent name.
- Each experiment declares at least one variant.
- Each variant references a defined agent and has a strictly positive
weight. - Variant agents within an experiment are unique.
- Strategy-specific fields are only set on the matching strategy (e.g.
primaryonly onshadow,metriconly onbandit). - For
shadow:primaryis set and matches one of the variants;sampling_rateis in[0.0, 1.0]. - For
bandit:metricisjudge.criterion; the judge exists, declares the criterion in its rubrics, and every variant opts into the judge;epsilonis in[0.0, 1.0]. - Every referenced judge exists.
- Judge names are unique.
- Every judge's
provideris configured and supported. - Every judge has at least one rubric.
- Every judge's
sampling_rateis in[0.0, 1.0].
Any violation fails fast with an error message that names the offending agent or judge and field.
Releasing
Coulisse follows Semantic Versioning. Pre-1.0, minor bumps may include breaking changes to the YAML schema, HTTP surface, or CLI; patch bumps will not.
Cutting a release
-
Bump the version in the workspace
Cargo.toml:[workspace.package] version = "0.2.0"All workspace crates inherit this via
version.workspace = true, so this is the only place to edit. -
Update
CHANGELOG.md— rename the## [Unreleased]section to## [0.2.0] - YYYY-MM-DDand start a fresh## [Unreleased]block above it. -
Commit, tag, push:
git commit -am "Release v0.2.0" git tag v0.2.0 git push && git push --tags
The v*.*.* tag triggers two workflows:
release.yml(cargo-dist) — builds binaries and installers for macOS (x86 + ARM), Linux GNU (x86 + ARM), and Windows MSVC, then publishes them as a GitHub Release with auto-generated notes.docker.yml— builds a multi-arch image and pushes toghcr.io/almaju/coulissetaggedlatest,0.2, and0.2.0.
Hotfixes
For patch releases on the latest minor, branch from the previous tag, fix
forward, then tag v0.2.1 from that branch. The same workflow handles it.
Roadmap
What's in Coulisse today, and what's coming.
Working today
- Multi-agent routing via the
modelfield. - Agents as tools — expose one agent to another under
subagents:with apurpose:description. Nested invocations are bounded by a depth cap. - Per-user conversation history with isolation.
- Long-term memory with semantic recall — persistent via SQLite and backed by a real embedder (OpenAI or Voyage AI;
hashfallback for offline dev). - Auto-extraction — an optional background task pulls durable facts from each exchange and deduplicates them before storing.
- Tunable memory budgets (
context_budget,memory_budget_fraction,recall_k) in YAML. - Multi-backend support (Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq).
- OpenAI-compatible HTTP API (
/v1/chat/completions,/v1/models). - Read-only studio UI at
/admin/for browsing conversations, memories, and judge scores. - LLM-as-judge evaluation — background scoring of agent replies against YAML-defined rubrics, with per-judge sampling and per-user persistence.
- Experiments (A/B testing) — wrap multiple agents under one addressable name and route traffic between them with sticky-by-user defaults. Three strategies:
split(weighted random),shadow(primary serves the user, others run in the background and are scored), andbandit(epsilon-greedy on a single judge criterion). - Streaming responses over SSE (
stream: true, withstream_options.include_usage). - MCP tool integration over stdio and HTTP, with per-agent filtering.
- Per-user token rate limiting (hour / day / month).
- YAML-driven config with startup validation.
- Docker image with a volume-mounted SQLite store.
Planned
Durable rate-limit state
Current rate-limit counters live in memory — they reset on restart and don't span multiple instances. A durable, shared backend is planned so quotas survive reboots and horizontal scaling.
Workflow orchestration
Chaining agents into declarative pipelines (one agent's output feeds the next, with conditional routing) — all configured in YAML rather than app code.
Vector index for large memory stores
Recall currently does a linear cosine scan over all memories for the user. Fine at hundreds-to-low-thousands of memories per user, but a vector index will be needed if per-user memory counts grow into the tens of thousands.
Per-agent memory overrides
Today the memory: block is global. A future revision will allow per-agent scoping (different embedders or budgets per agent) for cases where one agent handles long-form research and another handles short user chat.
This list reflects what's on deck at the time of writing — check the repository for the current state.