Coulisse
One YAML file. An OpenAI-compatible server with memory, tools, and multi-backend routing.
Coulisse is a single Rust binary that reads a coulisse.yaml file and spins up an OpenAI-compatible HTTP server. You point your existing tools, SDKs, and projects at it like any other OpenAI endpoint — and everything configurable lives in that one YAML file.
Why Coulisse?
Every multi-agent project ends up re-implementing the same plumbing:
- Per-user conversation memory
- Routing between model providers
- Rate limits and retries
- Tool integration
- Multiple agents with different system prompts
Coulisse collapses this plumbing into one configurable server. You describe the setup in YAML and pilot the whole thing from there, instead of writing glue code for each prototype.
How it works
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Your SDK / app │───────▶│ Coulisse │───────▶│ Anthropic │
│ (OpenAI client) │ │ │ │ OpenAI │
└──────────────────┘ │ coulisse.yaml │ │ Gemini … │
│ │ └──────────────────┘
│ + memory │
│ + MCP tools │ ┌──────────────────┐
│ + per-user │───────▶│ MCP servers │
└──────────────────┘ └──────────────────┘
- Your application talks to Coulisse using any OpenAI-compatible SDK.
- Coulisse picks the agent you asked for (by model name), assembles the user's memory, and calls the right backend.
- The response flows back — and the exchange is saved to that user's memory for next time.
What's in the box
| Feature | Status |
|---|---|
| Multi-agent routing | ✅ Working |
| Per-user memory | ✅ Persistent (SQLite) with semantic recall |
| Real embedders | ✅ OpenAI + Voyage (hash fallback for offline dev) |
| Auto-extraction | ✅ Optional — pulls durable facts from each exchange |
| MCP tool integration | ✅ Working (stdio + HTTP) |
| Multi-backend support | ✅ Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq |
| OpenAI-compatible API | ✅ /v1/chat/completions, /v1/models |
| Streaming responses | ✅ Server-Sent Events |
| Rate limiting | ✅ Per-user token quotas (hour / day / month, in-memory) |
| Studio UI | ✅ /admin/ — conversations, memories, judges, live task board, admin edits |
| Triggers (cron / webhook / boot) | ✅ Start agents on a schedule or via HTTP POST |
| Async task queue | ✅ Fire-and-forget background work with dispatch_task |
| Sidecars | ✅ Long-lived helper processes managed by Coulisse |
Config variables (vars:) | ✅ Named snippets shared across agent preambles |
IDE schema (coulisse schema) | ✅ JSON Schema for autocompletion in VS Code, Helix, Zed… |
| Durable rate-limit state | ⏳ Planned |
Continue to Installation to get started.
Stability
Coulisse is pre-1.0. It follows Semantic Versioning, but
during the 0.x phase, minor version bumps (0.1 → 0.2) may include breaking
changes to the YAML schema, HTTP surface, or CLI. Patch bumps (0.1.0 → 0.1.1)
will not. See the Releasing chapter and
CHANGELOG.md
for the version history.
Installation
Coulisse is a single Rust binary. Install it from a prebuilt release or build from source.
Requirements
- A valid API key for at least one supported provider
Install from a release
The latest GitHub Release ships installers for macOS (x86 + ARM), Linux GNU (x86 + ARM), and Windows MSVC.
macOS / Linux:
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/Almaju/coulisse/releases/latest/download/coulisse-installer.sh | sh
Windows (PowerShell):
powershell -ExecutionPolicy Bypass -c "irm https://github.com/Almaju/coulisse/releases/latest/download/coulisse-installer.ps1 | iex"
The installer drops the coulisse binary on your PATH.
Build from source
Requires Rust (edition 2024) — install from rustup.rs.
git clone https://github.com/Almaju/coulisse.git
cd coulisse
cargo build --release
The binary lands at target/release/coulisse. Drop it on your PATH
(or alias it) so the rest of this guide can call it as coulisse.
Initialize a config
coulisse init
This writes a minimal coulisse.yaml in the current directory: one
OpenAI agent, sqlite memory, the offline hash embedder. Run
coulisse init --from-example instead for the full annotated tour
covering every section.
Edit the file to set your provider API key.
Start the server
coulisse start
start runs the server detached: it returns immediately and the
process keeps running in the background. Stop it later with
coulisse stop.
To run attached (logs streaming to your terminal), use
coulisse start --foreground — or just coulisse with no subcommand.
Either form binds port 8421.
You should see a startup banner like:
coulisse 0.1.0
Proxy → http://localhost:8421/v1
Admin → http://localhost:8421/admin
Memory sqlite at .coulisse/coulisse-memory.db; user_state: disabled (history only)
Auth proxy: open · admin: open
Agents (1)
assistant openai / gpt-4o-mini
The exact lines depend on your config — what matters is that memory, auth, and every configured agent are each acknowledged on startup.
Next: write your first config, or read the CLI reference for every subcommand.
Your first config
A minimal coulisse.yaml has two things: a provider (where to send model calls) and an agent (how to call it).
providers:
anthropic:
api_key: ${ANTHROPIC_API_KEY}
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are a helpful assistant.
Save this as coulisse.yaml in your working directory, then run coulisse.
What each piece does
providers
A map of provider kind → credentials. The key must be one of the supported kinds (see Providers). You only need to list the providers you actually use.
API keys (and any other string values) can be read from environment variables using ${VAR_NAME} — Coulisse expands them before parsing the YAML. If a referenced variable is unset, the server refuses to start and names the missing variable. See the YAML reference for details.
agents
A list of agents. Each agent is a named recipe:
name— the identifier. Clients ask for the agent by this name via themodelfield in their request.provider— which configured provider to route to.model— the upstream model identifier to call (e.g.claude-sonnet-4-5-20250929,gpt-4o).preamble— optional system prompt prepended to every conversation.
You can define as many agents as you want — see Multi-agent routing for what that unlocks.
Adding more
Want a code reviewer, a pirate, and a tool-using agent? Just add more entries:
providers:
anthropic:
api_key: ${ANTHROPIC_API_KEY}
openai:
api_key: ${OPENAI_API_KEY}
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are a helpful assistant.
- name: code-reviewer
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: |
You are a thorough code reviewer. Focus on correctness,
clarity, and security.
- name: gpt-assistant
provider: openai
model: gpt-4o
preamble: You are a helpful assistant.
Restart the server — all three agents are now selectable by model name.
Next: make a request.
Making a request
Coulisse exposes an OpenAI-compatible API, so any OpenAI SDK works. Point the client at http://localhost:8421/v1 and set the model field to an agent name from your config.
curl
curl http://localhost:8421/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Python (openai SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8421/v1",
api_key="not-needed", # Coulisse doesn't check this
)
response = client.chat.completions.create(
model="assistant",
messages=[{"role": "user", "content": "Hello!"}],
extra_body={"safety_identifier": "user-123"},
)
print(response.choices[0].message.content)
TypeScript / JavaScript
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8421/v1",
apiKey: "not-needed",
});
const response = await client.chat.completions.create({
model: "assistant",
messages: [{ role: "user", content: "Hello!" }],
// @ts-expect-error — extra field passed through
safety_identifier: "user-123",
});
console.log(response.choices[0].message.content);
The safety_identifier field
Coulisse identifies users through the safety_identifier field (or the deprecated user field, which works too). The identifier is what keeps each user's conversation history isolated.
You can turn this off — see User identification — but by default every request needs one.
Listing available agents
curl http://localhost:8421/v1/models
Returns every agent you've defined, in OpenAI's model-list format.
That's the whole loop. Next, dig into how to configure providers.
Providers
Providers are where your model calls actually go. Configure each provider once with its credentials; reference it by name from any number of agents.
Supported providers
| Kind | Config key |
|---|---|
| Anthropic | anthropic |
| Cohere | cohere |
| Deepseek | deepseek |
| Gemini | gemini |
| Groq | groq |
| OpenAI | openai |
Shape
providers:
anthropic:
api_key: sk-ant-...
openai:
api_key: sk-...
gemini:
api_key: ...
Each provider takes a single field: api_key. You only need to list the providers you plan to use — unused ones can be omitted entirely.
Validation
When Coulisse loads your config, it checks that every agent's provider field matches a key under providers. Misspell a provider and startup fails with a clear error:
agent 'assistant' references provider 'antropic' which is not configured
Switching providers
Because providers are referenced by name, switching an agent from one backend to another is a one-line change:
agents:
- name: assistant
provider: anthropic # ← change this …
model: claude-sonnet-4-5-20250929 # ← … and this
preamble: You are helpful.
No client code changes, no redeployment of downstream apps. See Multi-backend support for more on mixing providers.
Agents
Agents are the named personas clients can talk to. Each agent pins down:
- Which provider to call
- Which upstream model to ask for
- What system prompt to prepend
- Which tools (if any) to expose
Shape
agents:
- name: code-reviewer
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: |
You are a thorough code reviewer. Focus on correctness,
clarity, and security. Point out subtle bugs and suggest
concrete improvements.
mcp_tools:
- server: hello
only:
- say_hello
Fields
name (required)
The agent identifier. Clients select this agent by passing name as the model field in their request. Names must be unique across the config.
provider (required)
Must match a key under the top-level providers map. Tells Coulisse which backend to route through.
model (required)
The upstream model identifier. This is provider-specific — e.g. claude-sonnet-4-5-20250929 for Anthropic, gpt-4o for OpenAI, gemini-2.0-flash for Gemini.
preamble (optional)
A system prompt prepended to every conversation this agent handles. Use it to define tone, expertise, constraints, output format — anything you'd normally put in a system message.
Defaults to empty. YAML block scalars (|) are handy for multi-line preambles.
judges (optional)
A list of judge names (from the top-level judges: block) that evaluate this agent's replies in the background. Empty or omitted = no evaluation. See LLM-as-judge evaluation for the full story.
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
judges: [quality, deep_audit]
mcp_tools (optional)
A list of MCP servers and tools this agent is allowed to use. See MCP tools for the full story.
mcp_tools:
- server: hello # all tools from "hello"
- server: calculator # all tools from "calculator"
only: # …but only these specific ones
- add
- multiply
skills (optional)
Names of skills from the top-level skills: directory this agent can use. Each listed skill becomes a tool: its description is advertised to the model, and calling it returns the skill's full instructions. Names not present in the catalog are silently ignored.
skills: [resume-review, salary-negotiation]
See Skills for the full walkthrough.
max_turns (optional)
Maximum number of tool-calling rounds per turn before Coulisse returns the last response. Defaults to 8. Raise it for agents that chain many tool calls in one go (e.g. a coder reading files, editing, handing off to QA).
max_turns: 16
subagents (optional)
A list of other agent names exposed to this agent as callable tools. When the agent's model decides to invoke one, Coulisse starts a fresh conversation against that agent and returns its final message as the tool result.
subagents: [onboarder, resume_critic]
Each name must refer to another entry under agents. Self-reference and duplicates are rejected at startup. Nested invocations are capped at depth 4 to prevent runaway loops. See Multi-agent routing for the full walkthrough.
purpose (optional)
A short tool description shown to other agents when this one is listed under their subagents. Keep it concrete — it's how a calling agent's model decides when to invoke this specialist. Omit it for agents that are only used directly by clients (never as subagents); fall back is "Invoke the '<name>' subagent." but a hand-written purpose is what makes multi-agent orchestration reliable.
purpose: Critique and rewrite a resume for a target role.
Runtime overrides
Agents can also be created, edited, and disabled at runtime through the admin UI or HTTP without touching coulisse.yaml. These runtime entries live in the SQLite database alongside conversation memory and judge scores; the YAML file is never modified by the server.
The resolution rule is simple: when a name is requested, the database is checked first. If a row exists there, it wins. Otherwise the YAML entry (if any) is used. A row can also be a tombstone — a marker that disables a YAML-declared name without removing it from the file.
Each runtime row carries a label visible in the admin UI:
- yaml — the agent comes from
coulisse.yaml, no database row exists. - dynamic — created via the admin UI or HTTP; no YAML entry of this name.
- override — both YAML and the database define this name; the database version is what runs.
- tombstoned — a database row disables this name; the agent is hidden from clients even if YAML still declares it.
A "Reset to YAML" action on an override deletes the database row, letting the YAML version reassert. The same action on a tombstoned row re-enables the agent. Database edits never modify the YAML file: if you want a change to survive a database wipe, edit the YAML.
Several agents, one config
Define as many agents as you want. A common pattern is having variants of the same model with different preambles:
agents:
- name: friendly
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are warm and encouraging.
- name: terse
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: Reply in one sentence. No preamble, no filler.
- name: pirate
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: Respond exclusively as a pirate, arrr.
Clients switch between them by changing the model field — no server redeploy, no code change.
Memory
Coulisse remembers two things automatically:
- Conversation history — every message in every turn, per user. Always on.
- User state — durable facts and preferences extracted from those conversations and recalled into future prompts. Off by default; one line of YAML turns it on.
Quick start
The simplest possible memory config is no config at all — omit the memory:
block and you get:
- Conversation history kept in SQLite at
.coulisse/coulisse-memory.db. - Long-term user state off.
To turn on long-term user state, that's the only line you write:
memory:
user_state: true
Now Coulisse will, after each turn:
- Ask a small "haiku-tier" model what's worth remembering about the user.
- Embed those facts and store them.
- On future requests, recall the most relevant ones and inject them into the prompt as a
Known about the user:block.
You don't pick the embedder or the extraction model — Coulisse derives both automatically from your providers: block. (See auto-derivation below for the rules.)
What gets injected into the prompt
When user state is on, every request to an agent gets a system message like:
Known about the user:
- [fact] lives in Paris
- [preference] prefers WhatsApp-style short answers
…inserted after your agent's preamble and before the conversation history.
Where data lives
There is nothing to configure. The database is always .coulisse/coulisse-memory.db
— the project state directory next to your coulisse.yaml, alongside the log,
PID, MCP secrets, and uploaded files. Created on first boot if missing.
For Docker, mount the .coulisse/ directory on a volume so it survives container
restarts.
Advanced
You usually don't need any of this. Skip unless you have a specific reason — defaults are picked to "just work" for the common case.
Picking the extraction model explicitly
By default Coulisse picks the cheapest available model from your providers:. To pin one:
memory:
user_state:
learn_from:
provider: anthropic
model: claude-haiku-4-5-20251001
Picking the embedder explicitly
memory:
user_state:
embed_with:
provider: voyage
model: voyage-3.5
api_key: pa-... # required for Voyage
Voyage is the only embedder that needs an explicit API key here — openai reuses the key from your top-level providers.openai entry.
Recall and dedup tuning
memory:
user_state:
recall_k: 5 # how many facts to recall per request
dedup_threshold: 0.9 # cosine similarity above which a "new" fact is dropped
max_facts_per_turn: 5 # cap on facts written per exchange
Auto-derivation
When user_state: true (or when fields under user_state: are omitted):
- Embedder. If
openaiis in yourproviders:, Coulisse usestext-embedding-3-smalland reuses the OpenAI key. Otherwise it falls back to the offlinehashembedder (deterministic, no semantic understanding — fine for tests, never for production). - Extraction model. Coulisse picks the first configured provider in this priority order —
anthropic→openai→gemini→groq→deepseek→cohere— and uses its known cheap model (e.g.claude-haiku-4-5-20251001,gpt-4o-mini).
If user_state: true but you have no providers configured, Coulisse refuses to start with a clear error.
Supported embedder models
openai:text-embedding-3-small(1536 dims, default),text-embedding-3-large(3072 dims),text-embedding-ada-002(1536 dims).voyage:voyage-3.5(1024, default),voyage-3-large(1024),voyage-3.5-lite(1024),voyage-code-3(1024),voyage-finance-2(1024),voyage-law-2(1024),voyage-code-2(1536).hash: any positivedims(default 32). Offline only.
Unknown model names fail at startup with a clear error.
Disabling user state
Either omit the user_state: field entirely or set it to false:
memory:
user_state: false
When disabled, Coulisse keeps conversation history but performs no extraction and no recall.
Example configs
Anthropic only — auto-everything
providers:
anthropic:
api_key: sk-ant-...
memory:
user_state: true
Auto-resolution: extraction uses claude-haiku-4-5-20251001, embeddings fall back to the offline hash embedder (because Voyage needs an explicit api_key).
OpenAI end-to-end
providers:
openai:
api_key: sk-...
memory:
user_state: true
Auto-resolution: extraction uses gpt-4o-mini, embeddings use text-embedding-3-small with the OpenAI key.
Anthropic completions + Voyage embeddings
providers:
anthropic:
api_key: sk-ant-...
memory:
user_state:
embed_with:
provider: voyage
model: voyage-3.5
api_key: pa-...
Offline dev — no external calls
Omit the memory: block entirely (or set user_state: false): conversation
history is kept on disk under .coulisse/, with no extraction or embedding API
calls. Delete the database any time with coulisse reset.
MCP tools
Coulisse can borrow tools from Model Context Protocol servers and hand them to your agents. The config has one rule: declare what the server is, not what protocol it speaks. Coulisse infers the transport from the shape of the entry.
Declaring MCP servers
mcp:
# Remote MCP — just paste the URL. OAuth is auto-enabled.
todoist:
url: https://ai.todoist.net/mcp
# Local stdio MCP — give it a command.
hello:
command: uvx
args:
- --from
- git+https://github.com/macsymwang/hello-mcp-server.git
- hello-mcp-server
# Plain HTTP MCP without auth — explicit opt-out.
calculator:
url: http://localhost:8080
oauth: false
The Todoist entry above is zero config: the same UX as ChatGPT. Paste the URL, and Coulisse runs RFC 8414 discovery + RFC 7591 Dynamic Client Registration on first use, mints a per-user connect link, stores the token in the vault.
You never write oauth: for the common case — a url: server gets discover-mode OAuth on its own. Reach for the oauth: map only to override scopes or use static credentials:
mcp:
# Override discovered scopes
custom:
url: https://example.com/mcp
oauth:
scopes: [read:items, write:items] # mode: discover is implied
# Pre-registered (static) OAuth credentials — for providers that
# don't support Dynamic Client Registration
legacy:
url: https://internal.example.com/mcp
oauth:
mode: static
authorization_url: https://auth.example.com/authorize
token_url: https://auth.example.com/token
client_id: my-client
client_secret: my-secret
redirect_uri: http://localhost:8423/mcp/legacy/oauth/callback
That's it. No transport: field, no oauth: block for the common case, no shim wrappers. Coulisse figures out:
url:present → HTTP/SSE transport (SSE if the URL path contains/sse, otherwise streamable HTTP).command:present → stdio transport, with optionalargs:/env:for the child process.oauth:is the only thing you opt into yourself, and only when the server actually needs it.
Auto-detected transport
The path heuristic: if the URL has an /sse path segment (https://mcp.atlassian.com/v1/sse), Coulisse uses the older MCP-over-SSE protocol. Everything else uses streamable HTTP. URLs without /sse that turn out to be SSE-only will fail with a Missing sessionId parameter 404 on first call; switch to the explicit form below.
stdio config fields
command(required) — the executable to spawn (uvx,python,node, …)args(optional) — argumentsenv(optional) — environment variables
mcp:
my-tool:
command: python
args: [-m, my_mcp_server]
env:
DEBUG: "1"
API_KEY: abc123
Explicit transport: (legacy / override)
The verbose form still works if you need to override the auto-detection:
mcp:
legacy:
transport: sse # one of: http, sse, stdio
url: https://example.com/v2/endpoint # despite no /sse segment
Existing YAMLs that use transport: continue to parse unchanged. New code should prefer the URL-only / command-only form above.
Per-user OAuth (optional)
MCP servers that require user-delegated credentials (Todoist, Atlassian, GitHub,
Google Drive, etc.) can be configured with an oauth: block. Coulisse handles
the authorization flow per-user and injects each user's token automatically at
call time — Alice's token is never reachable by Bob.
Two modes:
mode: discover (recommended for modern MCP servers)
Spec-compliant MCP servers (Todoist, Atlassian, Linear, …) advertise their OAuth
endpoints via /.well-known/oauth-authorization-server and accept Dynamic Client
Registration. Coulisse discovers + registers itself lazily, on the first user to
authorise. No credentials in YAML — and no oauth: block at all, since a
URL-based server defaults to discover:
mcp:
todoist:
url: https://ai.todoist.net/mcp
# discover OAuth is automatic; add an oauth: map only to pin scopes:
# oauth:
# scopes: [data:read_write]
A handful of servers only honour tokens issued to mcp-remote's grandfathered
client id and reject fresh DCR registrations (Todoist's hosted MCP is the
current example). For those, run mcp-remote yourself as a stdio child — there
is no special flag:
mcp:
todoist:
command: npx
args: [-y, mcp-remote, https://ai.todoist.net/mcp]
mode: static (for non-DCR providers)
For OAuth providers that require a pre-registered app (GitHub OAuth apps, classic Atlassian Connect, etc.):
mcp:
github:
transport: http
url: https://api.githubcopilot.com/mcp
oauth:
mode: static
authorization_url: https://github.com/login/oauth/authorize
client_id: "${GH_CLIENT_ID}"
client_secret: "${GH_CLIENT_SECRET}"
redirect_uri: https://coulisse.example.com/mcp/github/oauth/callback
scopes: [repo, read:user]
token_url: https://github.com/login/oauth/access_token
static requires: authorization_url, client_id, client_secret,
redirect_uri, token_url. Missing any of these at startup is a fatal error.
Both modes share the same infrastructure secrets (vault encryption + HMAC).
Coulisse auto-generates them on first boot and persists them in
.coulisse/secrets.env — no manual setup needed for local use. Override
with COULISSE_VAULT_KEY / COULISSE_HMAC_KEY env vars for hosted
deployments. auth.mcp_consumer_secret is optional (only gates the admin
POST /connect-link endpoint). Set public_base_url: at the top level
when Coulisse runs on a public hostname; defaults to
http://localhost:{port} for local use.
See Per-user OAuth for MCP for the full flow, endpoints, secrets resolution, and the security trust-model warning.
Granting tool access to agents
An agent only sees tools you explicitly give it. Reference the server name under mcp_tools:
agents:
- name: helper
provider: anthropic
model: claude-sonnet-4-5-20250929
mcp_tools:
- server: hello # all tools from "hello"
Restrict to a subset with only:
mcp_tools:
- server: hello
only:
- say_hello # only this tool, nothing else
Discovering tool names
On startup Coulisse connects to each non-OAuth MCP server and logs the tools it discovered. OAuth-enabled servers connect per-user on first use. Tool names in your only list must match what the server advertises — check the startup output or the server's own docs.
How tool calls work
When a request arrives for an agent with tools:
- Coulisse collects the agent's allowed tools from the MCP servers.
- It forwards them to the model as tool definitions.
- If the model calls a tool, Coulisse dispatches to the MCP server and feeds the result back.
- This loops until the model produces a final answer (up to 8 turns by default, configurable via the agent's
max_turnsfield).
Your client doesn't see any of this — the tool loop is invisible, and only the final assistant message is returned.
See MCP tool integration for a full walkthrough.
Multi-agent routing
Coulisse lets you define multiple agents and route between them with nothing more than the model field of a request. No extra endpoints, no custom headers, no proxy tricks.
Why it matters
Most apps end up needing more than one model configuration:
- A fast, cheap agent for classification and quick replies.
- A heavier agent for hard reasoning.
- A specialized agent (code reviewer, translator, summarizer) with a tuned preamble.
- A tool-using agent that can reach into an MCP server.
Without something like Coulisse, that means either multiple deployments or a growing pile of if (mode === ...) switches inside your app.
The pattern
Declare each variant as a separate agent:
agents:
- name: triage
provider: anthropic
model: claude-haiku-4-5-20251001
preamble: Classify the user's intent. Reply with a single word.
- name: reasoner
provider: anthropic
model: claude-opus-4-7
preamble: You are a careful reasoner. Think step by step.
- name: translator
provider: openai
model: gpt-4o
preamble: Translate the user's message into French.
Your application picks which agent to call by setting the model field:
fast = client.chat.completions.create(model="triage", ...)
smart = client.chat.completions.create(model="reasoner", ...)
fr = client.chat.completions.create(model="translator", ...)
What each agent brings to the request
When a request arrives, Coulisse:
- Looks up the named agent.
- Prepends the agent's preamble as a system message.
- Resolves the agent's allowed MCP tools (if any).
- Forwards the call to the agent's configured provider and model.
- Records the exchange in the caller's per-user memory.
Changing agents is free — you don't need to redeploy anything on the client side.
Discovering agents at runtime
GET /v1/models returns every agent in the config in OpenAI's standard model-list format. Useful for UIs that want to populate a model picker from the server:
curl http://localhost:8421/v1/models
Subagents: agents as tools
Routing by model lets the client pick an agent per request. Sometimes you want one agent to call another from within a turn, so the conversation stays with the top-level agent while specialists handle focused sub-tasks. Coulisse exposes this via the subagents field.
agents:
- name: onboarder
provider: anthropic
model: claude-haiku-4-5-20251001
purpose: Collect the user's profile — first name, last name, phone, goals.
preamble: |
Ask the user for any missing profile field. Keep questions short.
- name: resume_critic
provider: anthropic
model: claude-sonnet-4-5-20250929
purpose: Critique and rewrite a resume for a target role.
preamble: |
Given a resume and a target role, return a revised resume and
a bullet list of the biggest gaps to address.
- name: career_coach
provider: anthropic
model: claude-sonnet-4-5-20250929
subagents: [onboarder, resume_critic]
preamble: |
Guide the user. Delegate to `onboarder` if the profile is
incomplete, and `resume_critic` when they want resume work.
When career_coach runs, the onboarder and resume_critic agents appear in its tool list alongside any MCP tools. If the model calls onboarder, Coulisse starts a fresh conversation against that agent with just the message it was given — the onboarder sees its own preamble and its own MCP tools, nothing inherited from the parent. The onboarder's final assistant message is returned to the coach as the tool result.
The purpose field
purpose is the tool description shown to the calling agent. It's how the coach's LLM decides whether this subagent is the right choice for the current turn. Keep it short and concrete — "Critique and rewrite a resume for a target role" is good; "Helpful assistant" is useless.
If purpose is absent, Coulisse falls back to "Invoke the '<name>' subagent." — functional, but a clear purpose is what makes orchestration reliable.
Bounded recursion
Calling a subagent is itself a tool call — the subagent can have its own subagents, which can have their own, and so on. To prevent a pathological A → B → A → … loop from burning tokens, Coulisse caps nested invocations at depth 4. Going over returns a clear error that the parent agent sees and can react to.
Fresh context
Every subagent invocation starts with a new conversation. The subagent does not see the parent's message history, the user's original request, or any other sibling subagent's output. It gets only the message the parent passed when calling it, plus its own preamble.
This isolation is deliberate. It keeps subagents focused, prevents context bloat, and makes each subagent's behavior reproducible in isolation. If you want data to flow between agents, store it in an MCP server and have both agents read it — Coulisse owns no cross-agent state.
Why subagents and MCPs live side by side
mcp_tools and subagents both appear in an agent's tool list, but they model different things:
- An MCP tool is a stateless function call against an external server: fixed schema, data in and data out.
- A subagent is another LLM conversation that happens to be kicked off by a tool call. It has its own preamble, its own tool loop, and can itself delegate further.
Reach for mcp_tools when the work is a concrete operation (save a record, search a database, send an email). Reach for subagents when the work needs its own LLM reasoning under a different preamble.
MCP tool integration
Coulisse is a client for Model Context Protocol servers. Any MCP-compliant tool — a calculator, a filesystem browser, a REST API wrapper, your in-house data fetcher — becomes usable by any agent with a one-line config change.
End-to-end example
Imagine a small MCP server that exposes a say_hello tool. Register it and hand it to an agent:
providers:
anthropic:
api_key: sk-ant-...
mcp:
hello:
transport: stdio
command: uvx
args:
- --from
- git+https://github.com/macsymwang/hello-mcp-server.git
- hello-mcp-server
agents:
- name: greeter
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You greet people warmly.
mcp_tools:
- server: hello
Start the server. On boot you'll see Coulisse discover the server's tools and note them in the log.
Now the greeter agent can call say_hello whenever the model decides it's useful. Your client makes a normal chat completion request:
curl http://localhost:8421/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "greeter",
"safety_identifier": "user-1",
"messages": [
{"role": "user", "content": "Please greet Alice."}
]
}'
The model may call the tool one or more times; Coulisse runs the tool loop internally and returns only the final assistant message.
Under the hood, every invocation — tool name, arguments, result (or error) — is recorded against the assistant message that produced it, so you can replay the turn in the studio UI and see which tools fired and what came back. This is tool-call capture for debugging, not an extension of the OpenAI surface: the wire response your SDK receives is unchanged.
Transports
- stdio — good for local MCP servers you spawn yourself (Python scripts, Node programs, CLI tools). Coulisse manages the child process.
- http — good for long-running MCP services, especially ones shared across multiple Coulisse instances.
Both are configured the same way conceptually; see MCP tools for fields.
Scoping tools per agent
Different agents can see different subsets of tools, even from the same server:
agents:
- name: power-user
mcp_tools:
- server: filesystem # every tool the filesystem server offers
- name: read-only
mcp_tools:
- server: filesystem
only:
- read_file
- list_files # write / delete tools aren't exposed
This is Coulisse-side filtering — the model never sees the excluded tools, so it can't call them.
Tool loop limits
Coulisse caps a single request at 8 tool-call turns by default. If the model hasn't produced a final answer by then, the request ends. This keeps runaway loops from billing you forever. You can raise or lower the limit per agent with the max_turns field (see the agents YAML reference).
Capture limitations
Tool-call capture only runs on the streaming path — every OpenAI SDK uses streaming for chat completions by default, so this covers normal usage. Non-streaming requests ("stream": false) still execute tools correctly; their invocations just aren't captured for the studio trail, because rig's non-streaming API doesn't expose intermediate events.
If a client disconnects mid-stream after a tool call has fired but before the result lands, the call is persisted with result: null so the studio UI still shows that the attempt happened.
Per-user OAuth
For MCP servers that require each user to authorize access with their own account (Jira, GitHub, Google, etc.), see Per-user OAuth for MCP.
Multi-backend support
Coulisse speaks to six providers out of the box:
- Anthropic
- OpenAI
- Gemini
- Cohere
- Deepseek
- Groq
You can mix them freely in a single config.
Why mix backends?
- Cost tiering. Run quick tasks on a cheap model (Groq, Haiku, gpt-4o-mini), hard tasks on a flagship.
- Capability routing. Some tasks benefit from a specific provider's strengths — long-context summarization on Gemini, coding on Sonnet, reasoning on Opus.
- Redundancy. If one provider has an outage, flip a single
providerfield to route through another. - Evaluation. A/B the same preamble on two different models without changing any client code.
One config, many backends
providers:
anthropic:
api_key: sk-ant-...
openai:
api_key: sk-...
gemini:
api_key: ...
groq:
api_key: ...
agents:
- name: quick
provider: groq
model: llama-3.3-70b-versatile
preamble: Answer briefly.
- name: smart
provider: anthropic
model: claude-opus-4-7
preamble: Think carefully.
- name: long-context
provider: gemini
model: gemini-2.0-flash
preamble: You excel at synthesizing long documents.
Your client picks one by name — everything else stays the same.
The client side is unchanged
Because Coulisse exposes an OpenAI-compatible API no matter which provider is behind an agent, your client code never has to know. You don't install the Anthropic SDK, Gemini SDK, and OpenAI SDK side by side — you just use the OpenAI SDK and change the model field.
Streaming responses
Coulisse implements OpenAI's Server-Sent Events (SSE) format for chat completions. Set stream: true in the request and the server emits incremental chat.completion.chunk frames over the wire — drop-in compatible with the OpenAI Python and JavaScript SDKs and any client that already speaks the OpenAI streaming protocol.
Asking for a stream
Add stream: true to a normal /v1/chat/completions request:
{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}
The response is text/event-stream instead of JSON. Each frame is one chat.completion.chunk.
Wire format
The first frame announces the assistant role:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"role":"assistant"}}]}
Then one frame per text delta:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":"Hello"}}]}
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":" there"}}]}
A terminal frame sets finish_reason:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Including token usage
Set stream_options.include_usage: true to receive a usage field on the terminal chunk:
{
"model": "assistant",
"messages": [{"role": "user", "content": "Hi"}],
"stream": true,
"stream_options": {"include_usage": true}
}
The terminal frame then carries usage:
data: {"...":"...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"completion_tokens":3,"prompt_tokens":7,"total_tokens":10}}
When include_usage is missing or false, the field is omitted — matching OpenAI's contract.
Memory and rate limiting
Streaming responses use the same per-user memory bucket and rate-limit accounting as non-streaming requests:
- The user's message and the assistant's reply are appended to memory after the stream ends.
- Token usage is recorded against the rate-limit window when the stream ends.
- If the client disconnects mid-stream, Coulisse persists the partial assistant reply (everything received before the disconnect). This matches what the user actually saw — the next turn won't claim the model said something the user never received.
Tool-using agents
Agents with MCP tools attached stream the same way. Tool-call internals run inside the rig multi-turn loop and are not surfaced to the client; you'll see a pause while a tool runs, then the model's text continues. The delta.content field is the only delta variant Coulisse currently emits.
Subagent handoff events
When an agent delegates to a subagent, the stream doesn't go silent — Coulisse signals the handoff so your UI can show meaningful feedback instead of a frozen spinner.
handoff_started
Emitted immediately before the subagent is invoked:
event: handoff_started
data: {"agent":"resume_critic"}
Use this to update your UI: "Passing to resume_critic…" is better than a silent spinner.
Heartbeat
While a subagent is running, Coulisse emits a keep-alive comment every 20 seconds:
: heartbeat
This is a standard SSE comment (lines starting with :). Most SSE clients ignore it automatically — it exists to prevent proxies and load balancers from closing the connection during long subagent turns.
If your SSE stream goes silent for more than 20 seconds during a subagent turn, that's a bug — open an issue.
Sequence during a handoff
# Parent agent starts responding
data: {"choices":[{"delta":{"content":"Let me get the resume critic on this."}}]}
# Handoff announced
event: handoff_started
data: {"agent":"resume_critic"}
# Heartbeats while subagent works
: heartbeat
: heartbeat
# Subagent result flows back as normal content
data: {"choices":[{"delta":{"content":"Here's the revised resume…"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
The subagent's internal turns are not surfaced — you only see the final result as delta.content from the parent.
Errors mid-stream
If the upstream provider fails after the stream has started, Coulisse emits one terminal frame containing an error field with the failure reason, then [DONE]. The HTTP status is already 200 by then — clients should check for the error field on the final chunk.
Studio UI
Coulisse ships a studio UI for browsing the conversations and memories the server has seen, and for editing the live YAML config. It's served by the same binary, under /admin/.
Point a browser at http://localhost:8421/admin/ while the server is running, or run coulisse studio (alias coulisse admin) to open it for you.
What you can do
- List every user the server has seen, most recent activity first, with message and memory counts.
- Open a user to see their full conversation (user, assistant, and system messages) with per-message token counts and relative timestamps.
- See every tool invocation that happened during each assistant turn — rendered inline in the conversation as a collapsed block above the assistant bubble. Expand to see the args, the result (or error body), and a badge marking MCP vs subagent calls. This is the debug view for figuring out what the agent tried and what came back.
- Open the per-turn Telemetry block under any assistant message to see the full causal tree that produced it: every tool call (MCP or subagent) at every depth, with args, result, error, and duration. Unlike the inline top-level tool calls, the telemetry tree also surfaces tool calls made inside subagents — so when a subagent's MCP call fails, the real error is right there instead of being paraphrased into the assistant's text.
- See the long-term memories recalled for that user, tagged as
factorpreference. - See the LLM-as-judge scores for that user, including mean score per
(judge, criterion)and the most recent individual scores with reasoning. - Browse configured experiments at
/admin/experiments— strategy, sticky-by-user flag, per-variant weights, and bandit-strategy mean scores live-loaded from judges. - Run smoke tests at
/admin/smoke— a synthetic-user persona drives a real conversation against any agent or experiment, scores fan out through the same judge pipeline, and the run viewer shows the full transcript with persona/assistant turns side by side. Useful for iterating on agent prompts without writing test scaffolding. - Mint, monitor, and revoke API tokens at
/admin/tokens— issuesk-coulisse-…keys for the/v1/*proxy, each bound to a principal and a spend budget (unlimited, lifetime, or per-month). The list shows current-period and lifetime spend per token; the create form reveals the secret once. See API tokens. - Edit, add, or disable agents, judges, experiments, and smoke tests at
/admin/agents,/admin/judges,/admin/experiments, and/admin/smoke. Each form is a YAML textarea over the same config shape used incoulisse.yaml. Edits and creations write to the database, never tocoulisse.yaml; runtime resolution checks the database first, then falls back to YAML. List views label each row asyaml,dynamic(database-only),override(database shadows YAML), ortombstoned(disabled). Override rows expose a "Reset to YAML" action that drops the database row so the YAML version reasserts. See Agents → Runtime overrides for the full semantics — judges, experiments, and smoke tests follow the same model. - Configure infrastructure from the Settings hub at
/admin/settings. Each card — providers, MCP servers, memory, telemetry, auth, storage — links to its own editor (/admin/providers,/admin/mcp,/admin/memory,/admin/telemetry,/admin/auth,/admin/storage). Unlike agents/judges/experiments/smoke, these sections write straight tocoulisse.yaml(there is no database shadow) and apply after restart. The whole file is validated before anything touches disk, so an invalid edit is rejected and the running config keeps serving. - Edit the raw
coulisse.yamlat/admin/config/edit— a full-file YAML textarea backed byPUT /admin/config. The power-user escape hatch when you want to change several sections at once or touch a field that has no dedicated card.
Editing config: admin UI = API
Every admin route is content-negotiated. The same URL serves an HTML page in a browser, an HTML fragment to htmx, and JSON to a script — whichever the client's Accept/HX-Request headers ask for. The UI is a thin representation of the API; nothing the UI can do is unavailable to a curl call.
# List agents as JSON (effective merged view: database overrides + YAML)
curl -H 'Accept: application/json' http://localhost:8421/admin/agents
# Update an agent (writes to the database, not to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/agents/bob \
-H 'Content-Type: application/yaml' \
--data-binary $'name: bob\nprovider: openai\nmodel: gpt-4o\n'
# Reset an override or tombstone — drops the database row, YAML reasserts
curl -X POST http://localhost:8421/admin/agents/bob/reset
# Read one infrastructure section as JSON
curl -H 'Accept: application/json' http://localhost:8421/admin/telemetry
# Update one section in place (writes that slice back to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/telemetry \
-H 'Content-Type: application/yaml' \
--data-binary $'fmt:\n enabled: true\nsqlite:\n enabled: true\n'
# Replace the whole config file in one shot (this writes to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/config \
-H 'Content-Type: application/yaml' \
--data-binary @coulisse.yaml
The single-section endpoints — /admin/auth, /admin/memory, /admin/storage, /admin/telemetry (plus the collection endpoints /admin/providers and /admin/mcp) — splice just their slice into the file and leave every other key untouched, so a partial write can't clobber an unrelated section.
Agent writes through /admin/agents go to the database, never to coulisse.yaml. Other sections (/admin/config, providers, MCP, auth, memory, telemetry, storage, judges, experiments, smoke tests) write to YAML. The two write paths are independent: editing an agent in the database has no effect on the file you committed to git.
Secrets render in cleartext. The section editors round-trip the raw YAML slice, so provider API keys, basic-auth passwords, OIDC client secrets, and OTLP headers appear in plaintext in the textarea. The admin surface is authenticated (see below) and the values already live in coulisse.yaml, but be aware the studio is not a secrets vault — don't share your screen on the auth editor.
File watcher: hand-edits hot-reload
Coulisse watches coulisse.yaml while it runs. Edit it in your editor, save, and the live config updates without a restart. The validator runs before any reload — a broken edit is logged and the previous in-memory config keeps serving traffic until you fix the file.
What hot-reloads today: the agents list (runtime + admin display), the judges and experiments lists (admin display only — the routing tables that consume them are still rebuilt on restart). What still requires restart: providers, MCP servers, memory backend, telemetry pipeline, auth.
YAML formatting
Admin saves go through serde_yaml round-trip serialization, so comments, blank lines, and key ordering are not preserved. If you want commented config, hand-edit the file — the watcher picks the change up the same way an admin save would. Comment-preserving writes are tracked as a follow-up.
Authentication
The admin surface is gated by the auth.admin scope in coulisse.yaml. Two mutually exclusive modes: HTTP Basic auth (good for local dev) or OIDC single sign-on (appropriate for shared deployments). Exactly one belongs under auth.admin.
The /v1/chat/completions and /v1/models endpoints use the separate auth.proxy scope — they are never gated by admin auth. SDK clients stay cookie-free even when the studio runs behind OIDC.
Basic auth
auth:
admin:
basic:
password: choose-something-strong
username: admin # optional, defaults to "admin"
Every /admin/* request must carry Authorization: Basic <base64(user:pass)>. Browsers prompt via the native login dialog and cache credentials per origin.
OIDC (single sign-on)
Works with any OIDC-compliant IdP: Authentik, Keycloak, Auth0, Google, Microsoft, Okta.
auth:
admin:
oidc:
issuer_url: https://authentik.example.com/application/o/coulisse/
client_id: coulisse-admin
client_secret: <confidential-client-secret> # omit for public PKCE clients
redirect_url: http://localhost:8421/admin/
scopes: [email, profile] # optional; openid is always added
On first request, the user is redirected to the IdP to log in; afterwards an encrypted session cookie keeps them authenticated on /admin/* until it expires (8 hours of inactivity).
Access control (who may log in) is delegated to the IdP. Coulisse treats "successfully authenticated by your IdP" as "authorized admin" — configure the allow-list in the IdP's application policy, not here.
Authentik setup: create a new OAuth2/OpenID Provider and Application, set the redirect URI to the redirect_url above (Authentik allows every subpath of it by default), and point Coulisse at the issuer URL of the provider. Add the application to the groups that should have access via Authentik bindings.
Sessions are in-memory: they evaporate on restart — users re-authenticate silently if their IdP session is still valid, otherwise they see the login page again.
Leaving it open
Omit the auth.admin block to leave the admin surface unauthenticated. That's fine on a loopback-only dev box, but never expose an unauthenticated admin surface to the network. If you'd rather terminate auth at your infrastructure layer, put Coulisse behind a reverse proxy (oauth2-proxy, Cloudflare Access, Caddy's forward_auth), a VPN, or an SSH tunnel.
How it's built
The studio is composed in the cli binary. Each feature crate (memory, telemetry, judges, experiments) owns its own admin module — its routes, its askama templates, and its view models. Cli wires them together: a single base.html shell, the auth wrapping, and a tower middleware that wraps non-htmx responses in the layout so bookmarked deep URLs render with full navigation.
Cross-feature views (e.g. tool-call panels inside a conversation page) are filled in via htmx fragments — the conversation page, owned by memory, embeds hx-get requests against telemetry and judges. No feature crate depends on another for its admin surface; the browser orchestrates the composition. Tailwind (loaded via CDN) provides styling, and a small embedded app.js (served at /admin/static/app.js) highlights the active nav item and raises a toast on every save. Everything ships in the single Coulisse binary; there is no separate frontend build step.
Editing the infrastructure sections (auth, memory, storage, telemetry, plus providers and mcp) lives in the cli crate rather than in the feature crates. Those edits only need the shared ConfigPersister trait and the section's own serde shape — not the feature crate's database — so they belong at the config layer that owns coulisse.yaml, not with the runtime/data admin pages the feature crates own.
User identification
Coulisse keeps separate memory per user. To do that, it needs to know who is making each request.
How users are identified
Requests identify the user via one of these fields, in order:
safety_identifier(preferred — matches OpenAI's recent schema)user(deprecated, but still accepted)
{
"model": "assistant",
"safety_identifier": "alice@example.com",
"messages": [...]
}
The identifier can be anything — an email, an internal user ID, a UUID, an opaque token. Coulisse derives a stable internal UUID from it:
- If you pass a valid UUID, that's what's used.
- Otherwise, a deterministic v5 UUID is derived from the string, so the same identifier always maps to the same user.
Requiring identification
By default, Coulisse requires every request to carry an identifier. Unidentified requests are rejected with an error. This is the safe default: memory only works if you know who you're talking to.
default_user_id: a single-user fallback
For local development or single-user deployments, you can declare a default_user_id in coulisse.yaml. When a request arrives without safety_identifier or user, Coulisse acts as if that default had been passed.
default_user_id: main # everyone's anonymous requests bucket here
providers:
anthropic:
api_key: sk-ant-...
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
With a default_user_id set:
- Requests that omit both
safety_identifieranduserfall back to the default. They get memory like any other user — just scoped to that shared bucket. - Requests that do include an identifier still get their own scope.
- All anonymous requests share one memory bucket and one rate-limit counter, because they all map to the same id.
When to set it
Good reasons:
- Local / single-user setups where you don't want to bother sending an identifier.
- Small deployments behind an auth layer that handles identity upstream but doesn't want to plumb it through.
Don't set default_user_id in multi-tenant deployments — every user would share one bucket, which defeats isolation. Leave it unset so missing identifiers are rejected.
Trust model
Everything keyed by user — conversation history, long-term memory, semantic recall, per-user MCP OAuth sessions, and rate-limit counters — is partitioned by the identifier on the request. Those partitions are airtight: a query never crosses users, and one user's handle can't reach another user's data.
But understand where the identifier comes from. By default it is asserted by the client in the request body (safety_identifier). In that mode the auth layer gates access to the proxy but does not bind the authenticated principal to the identifier, so any caller who can reach /v1/chat/completions can claim any identifier:
{ "model": "assistant", "safety_identifier": "someone-else", "messages": [...] }
This is the right default for two common shapes, and unsafe for a third:
- Single-user / local. One identity, nothing to spoof.
- Trusted first-party backend. A backend that authenticates its own users and sets
safety_identifierhonestly on their behalf gets full isolation. The identifier-setting boundary lives on a server you control. - Untrusted clients calling directly. If end users hold credentials and call Coulisse themselves — each able to send arbitrary JSON — any of them can read or write another user's memory and drive any MCP server that user has authorized, simply by claiming their identifier. Body-asserted identity does not isolate these clients.
Binding identity to the credential
For the third shape, set auth.proxy.identity: from_credential. Coulisse then ignores the body's safety_identifier and derives the user from the authenticated principal — the Basic username or the OIDC sub claim. A request that claims a different identifier is rejected with 403; the front desk now checks ID against the credential.
auth:
proxy:
oidc:
issuer_url: https://authentik.example.com/application/o/coulisse/
client_id: coulisse-proxy
client_secret: <secret>
redirect_url: http://localhost:8421/v1/
identity: from_credential
Two rules the server enforces at startup:
from_credentialrequiresauth.proxyto configurebasicoroidc— you can't bind to a credential that isn't checked.- It is mutually exclusive with
default_user_id. A shared default bucket would be a silent bypass, so the combination is rejected rather than letting one quietly win.
With Basic, the username is the identity, so each distinct user needs distinct credentials — a single shared username collapses everyone into one bucket. OIDC is the natural fit for many users: each gets a distinct sub automatically. See the auth.proxy.identity reference for the field details.
API tokens
Coulisse can issue its own API keys — the same model as the OpenAI dashboard. It mints sk-coulisse-… bearer tokens, stores only their hash, gates the /v1/* proxy on them, tracks how much each token spends, and lets you cap that spend or revoke the token at any time.
This is the recommended way to expose Coulisse beyond loopback: hand each client (a teammate, a script, a deployed app) its own token instead of a shared password, and you get per-token attribution and control for free.
Enabling
Turn the scheme on under the proxy auth scope:
auth:
proxy:
tokens: {} # the empty map is the switch
With this set, every /v1/* request must carry a valid token:
Authorization: Bearer sk-coulisse-…
A missing or unknown token gets 401; a revoked one also gets 401. Point any OpenAI SDK at Coulisse and pass the token as the API key — nothing else changes.
Until
auth.proxy.tokensis set, the proxy stays open and any tokens you mint are inert (the studio notes this). The Tokens studio page and thecoulisse tokenCLI are always available, so you can pre-mint before flipping the switch.
Identity binding
Every token binds to a principal — the user id that partitions memory, recall, and rate limits. Token auth therefore always implies credential-bound identity: the request runs as the token's principal, and a request body claiming a different safety_identifier is rejected with 403. Because the identity comes from the token, default_user_id is meaningless here and combining the two is rejected at startup.
Issue multiple tokens with the same principal to give one user several keys (laptop, CI, phone) that share one memory bucket. Issue distinct principals to keep clients fully isolated.
Budgets
Each token carries a spend budget, checked before every call — a request that would exceed it is rejected with 429 insufficient_quota (matching OpenAI's quota response) and no provider call is made:
| Budget | Behaviour |
|---|---|
unlimited | Never blocks. Spend is still tracked for monitoring. |
total | Lifetime cap. Blocks once cumulative spend reaches the limit. |
monthly | Per-calendar-month cap (UTC). Resets on the first of each month. |
Spend is computed from the same pricing table the cost tracker uses, summed per token in USD. Both streaming and non-streaming turns are charged.
Managing tokens
Studio
The Tokens page (under Configure in the studio nav) lists every token with its principal, budget, current-period spend, and lifetime spend. Use the form to mint a new one — the secret is shown once, immediately after creation, and never again. Each active token has a Revoke button.
CLI
# Mint an unlimited token
$ coulisse token create laptop --principal alice
created token 4f3c… for alice (unlimited)
sk-coulisse-9bQ… # the secret, on stdout only
# Mint with a $20/month cap
$ coulisse token create ci --principal alice --budget monthly --limit 20
# List tokens with spend
$ coulisse token list
4f3c… active laptop unlimited spent $1.27 [alice]
a91d… active ci $20.00 / month spent $0.04 [alice]
# Revoke
$ coulisse token revoke 4f3c…
revoked 4f3c…
The secret prints to stdout and the context to stderr, so coulisse token create … > key.txt captures only the key. The CLI talks to the same SQLite database the server uses (WAL mode), so tokens minted while the server is running are live immediately.
How it's stored
The auth crate owns two tables in the shared database: api_tokens (the hashed secret, label, principal, budget, and timestamps) and token_usage (one row per charged turn, in integer micro-USD). The plaintext secret exists only in the response to the mint call — Coulisse keeps a SHA-256 digest and nothing more, so a database leak never exposes a usable key.
Rate limiting
Coulisse enforces per-user token limits across three rolling windows: hour, day, and month. Limits are set by the client, per request — not in the YAML — so callers can plug Coulisse into existing quota schemes without redeploying the server.
How it works
- Each request carries optional limit hints in its
metadatafield:tokens_per_hour,tokens_per_day,tokens_per_month. - Before the model is called, Coulisse looks up the user's current usage in each window. If any counter is already at its cap, the request is rejected with
429 Too Many Requests. - If the request passes, Coulisse runs it. On success, the total tokens consumed (request + response) are added to the user's counters.
- Counters reset on fixed boundaries: every hour, every 24 hours, every 30 days (aligned to UTC windows from the Unix epoch).
Sending limits
Put the caps in the metadata object. Values are strings (OpenAI's metadata contract), parsed as non-negative integers:
{
"model": "assistant",
"safety_identifier": "alice@example.com",
"metadata": {
"tokens_per_hour": "50000",
"tokens_per_day": "500000",
"tokens_per_month": "5000000"
},
"messages": [
{"role": "user", "content": "Hi!"}
]
}
All three keys are independent and all are optional — send only the windows you care about. Omit the whole metadata object and the request is unlimited.
When a limit is hit
The server responds with:
- Status:
429 Too Many Requests - Header:
Retry-After: <seconds>— time until the offending window resets - Body:
{
"error": {
"type": "rate_limited",
"message": "daily token limit exceeded: used 512000/500000, retry after 40213s"
}
}
The message names which window tripped (hourly, daily, monthly), how many tokens were used, the cap, and the seconds to wait.
Invalid metadata
If a metadata value isn't a valid non-negative integer, the server returns 400 Bad Request:
{
"error": {
"type": "invalid_request",
"message": "metadata key 'tokens_per_hour' must be a non-negative integer, got 'abc'"
}
}
Scope and isolation
- Per user. Each user (keyed by
safety_identifieror the fallbackuserfield) has isolated counters. - Anonymous requests can't be rate-limited. Coulisse needs an identifier. In setups with a
default_user_id(see User identification), all anonymous requests share that user's counter. - Per process. Counters live in memory. If you run multiple Coulisse instances behind a load balancer, each has its own view — for shared quotas, limit upstream (in a gateway) instead.
- Lost on restart. Counters are not persisted. This is deliberate for now; durable accounting is on the roadmap.
Why per-request limits instead of YAML?
Quotas usually live in your user/billing system, not your model-routing config. Putting limits in the request lets the caller decide — e.g. your app looks up the user's plan, fills in the numbers, and forwards the request. Coulisse just honors what you send.
Per-user OAuth for MCP servers
Coulisse can authenticate each end-user independently with third-party MCP servers (Todoist, Atlassian, GitHub, Google, and others) using OAuth 2.0/2.1. When an agent calls a tool on an OAuth-enabled MCP server, Coulisse automatically uses the credentials that the requesting user has authorized.
⚠️ Trust boundary: Coulisse trusts the
user_idpassed in the chat request'ssafety_identifierfield the same way Stripe trusts acustomer_id— it assumes the caller is your authenticated backend, not an end-user directly. If you expose Coulisse's/v1/endpoint directly to untrusted clients without an auth proxy, any client can claim anyuser_idand access another user's connected accounts. Always place an auth proxy (your own backend, a gateway, or Coulisse'sauth.proxyOIDC scope) between Coulisse and untrusted callers before deploying with OAuth-enabled MCP servers.
Just point at the URL
For a spec-compliant MCP server, you write nothing about OAuth at all. A
remote MCP is just a url::
mcp:
todoist:
url: https://ai.todoist.net/mcp
URL-based servers get per-user OAuth discovery + Dynamic Client Registration
automatically. Tokens land in Coulisse's per-user vault, keyed by
(server, user_id) — no Node process, no shared on-disk cache, no
browser-callback port. The transport is inferred from the path (/sse in the
path → SSE, otherwise streamable HTTP); force it with an explicit
transport: http|sse when the path doesn't make it obvious.
If you need to tune the flow, the oauth: block has three uses:
- Disable auth on a public, no-auth HTTP MCP:
oauth: false. - Set scopes while keeping automatic discovery:
oauth: { scopes: [a, b] }(mode defaults todiscover). - Static credentials for a provider without Dynamic Client Registration:
oauth: { mode: static, ... }(see below).
Servers that only honour mcp-remote's client id
A few providers (Todoist today) haven't opened registration and only accept
the grandfathered client id baked into the mcp-remote CLI. For those, declare
the stdio command form yourself and let mcp-remote carry the token:
mcp:
todoist:
command: npx
args: [-y, mcp-remote, https://ai.todoist.net/mcp]
This runs mcp-remote as a stdio child the normal way; Coulisse doesn't
rewrite or inspect it. Use the plain url: form for any server that supports
Dynamic Client Registration.
Two flavours
oauth: blocks come in two modes, picked with the mode: discriminator
(which defaults to discover, so you only write mode: to select static):
mode: discover(default) — MCP-spec OAuth 2.1 with discovery + Dynamic Client Registration. Coulisse reads the provider's authorization-server metadata from<mcp_origin>/.well-known/oauth-authorization-serverand registers itself as a client on first use. No credentials in YAML. This is the right choice for modern MCP servers — Todoist, Atlassian (mcp.atlassian.com), Linear, and so on, and is what a bareurl:uses.mode: static— classic OAuth 2.0 with pre-registered app credentials. You register Coulisse as a client at the provider's developer console and paste the resultingclient_id/client_secrethere. Use this for providers that don't support Dynamic Client Registration.
Both modes drive the same per-user token flow: tokens are stored in the vault
keyed by (server_name, user_id), never shared across users.
How it works
-
Tool call hits
NotConnected: The user makes a chat request, the agent calls a tool on the MCP server, Coulisse looks up(server, user_id)in the vault, finds no token, and returns aNotConnectedToolplaceholder whose tool result contains a per-user, single-use connect URL built from the HMAC key. The LLM reads that result and relays the URL to the user.For agents that haven't pinned an
only:list (the common case — "give the agent every tool the server exposes"), Coulisse can't know the real tool schemas until someone has authorised at least once. Until then it surfaces a single sentinel tool namedconnect_<server>whose description tells the LLM to call it when the user asks to use that server. Calling it returns the same per-user connect URL. Once the user authorises, the sentinel goes away and the real tool list takes its place transparently. -
User clicks the link: lands on
GET /mcp/{server}/connect?token=…on Coulisse. Coulisse validates the HMAC, then fordiscovermode only, lazily runs discovery + Dynamic Client Registration if it hasn't yet (cached inmcp_oauth_clientsafterwards). Discovery is a two-step walk: first<mcp_origin>/.well-known/oauth-protected-resource(RFC 9728) to find which issuer hosts the authorization server (Todoist's MCP lives onai.todoist.net, its auth server lives ontodoist.com), then<issuer>/.well-known/oauth-authorization-server(RFC 8414) for the actual endpoints. Coulisse then 302s to the provider'sauthorization_endpoint. -
User authorizes: signs into their own account at the provider, sees a consent screen, and the provider redirects back to Coulisse's callback.
-
Token stored: Coulisse exchanges the code for tokens and stores them encrypted in
mcp_oauth_tokensunder the user's id. -
Subsequent tool calls succeed: the next chat turn on the same
user_idspawns a real per-user MCP session backed by the stored token.
Every user authorizes independently. Alice's token is never usable by Bob — they have separate vault rows, separate MCP sessions, and separate consent flows.
YAML configuration
Discover mode (recommended for spec-compliant servers)
public_base_url: http://localhost:8421 # see "Public base URL" below
mcp:
todoist:
url: https://ai.todoist.net/mcp
# oauth is implied; add a block only to override scopes:
# oauth: { scopes: [data:read_write] }
auth:
mcp_consumer_secret: "${COULISSE_MCP_SECRET}"
Nothing else to fill in — a bare url: already implies discover-mode OAuth.
Coulisse handles discovery and DCR on first use.
Static mode (for non-DCR providers)
mcp:
jira:
url: https://mcp.atlassian.example.com
oauth:
mode: static
authorization_url: https://auth.atlassian.com/authorize
client_id: "${JIRA_CLIENT_ID}"
client_secret: "${JIRA_CLIENT_SECRET}"
redirect_uri: https://coulisse.example.com/mcp/jira/oauth/callback
scopes:
- read:jira-work
- write:jira-work
token_url: https://auth.atlassian.com/oauth/token
auth:
mcp_consumer_secret: "${COULISSE_MCP_SECRET}"
oauth: block fields
| Field | Mode | Description |
|---|---|---|
mode | both | discover or static |
scopes | both | OAuth scopes to request (optional; discover falls back to scopes_supported) |
authorization_url | static | Provider's OAuth authorize endpoint |
client_id | static | OAuth application client ID |
client_secret | static | OAuth application client secret; ${ENV} expansion supported |
redirect_uri | static | Must match what you registered with the provider |
token_url | static | Provider's token exchange endpoint |
For discover mode, the redirect_uri is computed automatically from
public_base_url as {public_base_url}/mcp/{server}/oauth/callback. The
authorization, token, and registration endpoints all come from discovery.
Public base URL
Coulisse needs to know its own externally reachable URL to build OAuth redirect URIs and the per-user connect links surfaced to LLMs:
public_base_url: https://coulisse.example.com # no trailing slash
If omitted, defaults to http://localhost:{port}, which is right for personal
and local-dev setups. Set it explicitly when Coulisse runs behind a tunnel,
reverse proxy, or on a public hostname — the same value must match whatever the
OAuth provider sees as the redirect URI host.
Secrets (zero config by default)
Coulisse needs two long-lived 32-byte secrets when an OAuth-enabled MCP server is configured:
- vault key — encrypts stored tokens (and any cached DCR
client_secret) at rest with AES-256-GCM - HMAC key — signs the per-user connect links Coulisse mints for the LLM, plus the OAuth
statetoken
You don't have to manage these for local use. On first boot Coulisse
generates both and writes them to .coulisse/secrets.env (mode 0600,
already .gitignored), then reuses the file on every subsequent start.
Back this file up. Losing it invalidates every token in
mcp_oauth_tokens — users have to re-authorize each connected MCP server.
For deployments that source secrets from a vault/k8s/CI, set them as environment variables and Coulisse will use those instead of touching the on-disk file:
| Variable | Purpose |
|---|---|
COULISSE_VAULT_KEY | 32 bytes, base64-encoded. Overrides the on-disk vault key. |
COULISSE_HMAC_KEY | 32 bytes, base64-encoded. Overrides the on-disk HMAC key. |
Both are optional. Resolution order: env vars > .coulisse/secrets.env > generated on the fly.
One additional optional secret gates the admin endpoint only:
| Variable | Purpose |
|---|---|
COULISSE_MCP_SECRET (via auth.mcp_consumer_secret) | Arbitrary string. When set, gates POST /mcp/{server}/connect-link. When unset, that endpoint returns 503 and the per-user GET /connect flow keeps working. |
Endpoints
Coulisse exposes three OAuth-related HTTP routes:
GET /mcp/{server}/connect
The user-facing route. The URL Coulisse mints inside NotConnectedTool looks
like this and is what the LLM hands the user:
{public_base_url}/mcp/{server}/connect?token={hmac_signed_token}
The token is HMAC-signed with COULISSE_HMAC_KEY and embeds the user_id
plus a 10-minute expiry. The handler:
- Validates the HMAC and expiry.
- For
discovermode: ensures the server is registered (lazily runs discovery + DCR on the first hit; reuses the cachedmcp_oauth_clientsrow on subsequent hits). - 302-redirects to the provider's authorization endpoint with a fresh
statetoken carrying the sameuser_id.
POST /mcp/{server}/connect-link
Admin-facing alternative. Bearer-authed with COULISSE_MCP_SECRET. Useful when
your backend wants to email a user a connect link without going through the
LLM's tool result:
POST /mcp/{server}/connect-link?user_id=<user_id>
Authorization: Bearer <mcp_consumer_secret>
Response 200:
{ "url": "https://...provider.../authorize?client_id=...&state=<signed_token>" }
Hand this URL to your end-user. Valid for 10 minutes.
Error codes:
| Code | Reason |
|---|---|
| 401 | Wrong or missing consumer secret |
| 404 | Server name not found in config |
| 422 | user_id query parameter missing, or server exists but has no oauth: block |
| 502 | Discovery or DCR failed (discover mode only — check Coulisse logs) |
GET /mcp/{server}/oauth/callback
The provider's redirect target. Coulisse validates the state HMAC, exchanges the authorization code for tokens, stores them encrypted in SQLite, and shows an HTML success page to the user.
A tampered or expired state returns HTTP 400.
Token + client storage
Two tables under the shared SQLite database, both maintained by the mcp
crate's schema migrator:
mcp_oauth_tokens— encrypted per-user tokens keyed by(server_name, user_id). AES-256-GCM with the nonce prepended. Connecting again overwrites the previous token.mcp_oauth_clients— cached Dynamic Client Registration fordiscovermode servers. One row per server. Theclient_secretis encrypted when present; themetadata_jsondocument is stored plaintext (the provider's authorization-server metadata isn't a secret). Coulisse-wide, not per-user — theclient_ididentifies the Coulisse instance, not the end user.
Per-user session lifecycle
stdio transport: Each (user_id, server_name) gets its own spawned process
on first use, held in an LRU cache (cap: 256 by default, idle timeout: 30
minutes). The access token is passed as the MCP_OAUTH_TOKEN environment
variable. (Most spec-compliant MCP servers use HTTP transport — the stdio path
is for servers you launch via an explicit command:, such as a self-declared
npx mcp-remote <url> shim.)
HTTP transport: A per-user connection is established with
Authorization: Bearer <token> as a default header. Same LRU cache applies.
When a user hasn't connected yet
If an agent calls a tool on an OAuth-enabled MCP server and the calling user has no stored token (or the token is expired), the tool returns a placeholder result containing the connect URL. The LLM reads it and relays it to the user:
Not connected: the user has not authorized access to the 'todoist' MCP server. Show them this link and ask them to open it to link their account — the link is single-use and tied to their identity, do not share it with anyone else:
http://localhost:8421/mcp/todoist/connect?token=…
This is a tool result, not a 500 error. The user clicks the link, authorizes, and the next chat turn just works. No backend intervention required for the common case.
Per-user memory
Every request gets an isolated, persistent memory scope based on its user identity. In users: per-request mode, that identity comes from safety_identifier (or the deprecated user field) on each request; in the default users: shared mode, every request shares one hardcoded identity (and one memory bucket). See User identification. Coulisse tracks two kinds of memory:
- Conversation history — the running transcript of messages the user has exchanged. Always on.
- Long-term user state — durable facts and preferences, embedded for semantic recall. Off by default; opt in with
user_state: true.
You don't manage either of these by hand — both are wired into every request automatically. When user_state is on, Coulisse also decides what is worth remembering after each turn.
What happens on each request
- Coulisse identifies the user — from
safety_identifier/userinper-requestmode, or from the shared identity insharedmode. - It pulls the user's recent messages, fitting as many as possible into the context window.
- If long-term user state is on, it runs a semantic recall against the user's stored facts and picks the top matches.
- It builds the final prompt: agent preamble → recalled facts (if any) → recent history → new message.
- The model's reply is sent back and saved to the user's transcript.
- If
user_stateis on, a background task asks a cheap model "any durable facts to remember from this exchange?" and stores novel ones.
Step 6 does not block the HTTP response — the user gets their answer first; long-term memory grows in the background.
Isolation guarantees
User isolation is enforced by the API: Store::for_user(id) returns a handle scoped to a single user, and every SQL query bound through it filters on that user id. There is no code path that mixes data across users.
How long-term recall works
When user_state: true, Coulisse embeds each stored fact as a vector at write time. On every request, it embeds the incoming user message and retrieves the top-k most similar facts by cosine similarity. That's how context from a conversation two weeks ago can surface when it becomes relevant again.
The recalled facts are formatted as a system block titled Known about the user: and injected into the prompt before the conversation history.
Auto-extraction ("remember what matters")
When user_state: true, every completed exchange fires a background task that:
- Sends the last user-turn + assistant-turn to a cheap model with a focused prompt: "list any durable facts or preferences about the user; return
[]if nothing worth keeping." - Parses the JSON response.
- For each extracted fact, calls
remember_if_novel— which embeds the fact and skips it if cosine similarity against an existing memory exceedsdedup_threshold(default 0.9).
Failures (bad JSON, timeout, provider error) are logged at warn and swallowed — the user already got their response. Extraction is best-effort.
To disable, omit the user_state: field or set it to false. Conversation history is unaffected either way.
Embedders
| Provider | Supported models | Notes |
|---|---|---|
openai | text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002 | Default pairing for OpenAI-first setups. |
voyage | voyage-3.5, voyage-3-large, voyage-3.5-lite, voyage-code-3, voyage-finance-2, voyage-law-2, voyage-code-2 | Anthropic officially recommends Voyage for embeddings. Requires an explicit api_key. |
hash | n/a | Deterministic bag-of-words, offline only. No semantic understanding — use only for tests and air-gapped development. |
When user_state: true and you don't pin an embedder explicitly, Coulisse picks one for you (see auto-derivation). Startup logs the chosen embedder.
What gets stored where
| Data | Scope | Lifetime |
|---|---|---|
| Conversation messages | Per user | SQLite (messages table) |
| Long-term memories + vectors | Per user | SQLite (memories table, BLOB embeddings) |
| Tool invocations | Per user | SQLite (tool_calls table, linked to messages.id) |
| Judge scores | Per user | SQLite (scores table, linked to messages.id) |
| User identifier → internal ID | Shared | Derived deterministically — no storage needed |
Each memory row carries the id of the embedder that produced it. If you swap the embedder, old vectors become ineligible for recall (they'd be scored in the wrong space). They stay in the database but are silently ignored until you re-embed them.
Storage location
The database lives at .coulisse/coulisse-memory.db — the project state directory next to your coulisse.yaml, shared with the log, PID, MCP secrets, and uploaded files. The path is not configurable; everything Coulisse persists stays under .coulisse/.
Docker
Mount the .coulisse/ directory so the database (and the rest of Coulisse's state) survives container restarts:
docker run \
-v $(pwd)/.coulisse:/app/.coulisse \
-v $(pwd)/coulisse.yaml:/app/coulisse.yaml:ro \
-p 8421:8421 \
coulisse
See memory configuration for the full YAML schema.
File attachments (OpenAI-compatible storage)
Coulisse exposes a /v1/files API that matches the OpenAI Files API shape exactly. Any OpenAI-compatible SDK works without modification.
What this lets you do
- Upload a file once, reference it by
file_idin any subsequent chat request. - Pass multimodal content (images, PDFs, text) to an LLM backend that supports it — Coulisse stores the file and forwards it transparently.
- Set a storage quota so the disk never fills up (oldest files evicted first).
Endpoints
| Method | Path | Description |
|---|---|---|
POST | /v1/files | Upload a file (multipart/form-data) |
GET | /v1/files | List all uploaded files |
GET | /v1/files/:id | Get metadata for one file |
GET | /v1/files/:id/content | Download file content |
DELETE | /v1/files/:id | Delete a file (idempotent) |
Upload example
curl -X POST http://localhost:3000/v1/files \
-F "file=@cv.pdf;type=application/pdf" \
-F "purpose=assistants"
Response:
{
"id": "file-01j9abc...",
"object": "file",
"bytes": 42381,
"created_at": 1722000000,
"filename": "cv.pdf",
"purpose": "assistants",
"content_type": "application/pdf"
}
Then reference the file in a chat request:
{
"model": "gpt-4o",
"messages": [{
"role": "user",
"content": [
{ "type": "text", "text": "Summarise this CV in three bullet points." },
{ "type": "input_file", "file_id": "file-01j9abc..." }
]
}]
}
Configuration
Add a storage: block to coulisse.yaml. Everything has a default — if you omit the block, the filesystem backend is used with no quota. Filesystem blobs always live under .coulisse/files, next to your config; the path is not configurable.
storage:
backend: fs # "fs" (default) or "s3"
max_file_bytes: 52428800 # 50 MB per file — omit for no limit
max_total_bytes: 524288000 # 500 MB total — omit for no limit
Docker: mount the
.coulisse/directory to persist uploaded files (and the rest of Coulisse's state) across container restarts.
S3-compatible backend
Swap backend: s3 to store blobs in AWS S3, Cloudflare R2, or MinIO:
storage:
backend: s3
s3:
bucket: my-coulisse-files
region: eu-west-3
# endpoint_url: http://localhost:9000 # for MinIO / local S3
max_file_bytes: 52428800
Credentials are read from the standard AWS credential chain (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars, IAM role, ~/.aws/credentials, etc.).
Note: Set
endpoint_urlwhen using MinIO or another self-hosted S3-compatible service — path-style addressing is enabled automatically in that case.
Allowed file types
Coulisse validates file content via magic bytes (not just the declared Content-Type) and rejects anything outside this list:
text/*image/*application/pdfapplication/jsonapplication/octet-stream
Attempting to upload an executable or other unsupported type returns 415 Unsupported Media Type.
Storage limits and eviction
| Setting | Default | Effect |
|---|---|---|
max_file_bytes | no limit | 413 Payload Too Large if exceeded |
max_total_bytes | no limit | Oldest file is deleted to make room |
Eviction is FIFO: when a new upload would push the total over max_total_bytes, the oldest file (by created_at) is deleted first, then the next oldest, until there is room.
S3 caveat: quota accounting is best-effort under concurrent load — two simultaneous uploads might both pass the check and briefly exceed the limit. The next upload will evict back within bounds.
Deduplication
Coulisse computes a SHA-256 of each uploaded file. If you upload the same bytes twice, the second call returns the same file_id — no storage is consumed and no blob is written twice.
v1 limitation — deduplication is global, not per-user. If two different callers upload identical bytes, they receive the same file_id and share the underlying blob. A DELETE by either caller removes the file for both. This is safe when Coulisse runs as a single-tenant tool (one team, one trusted process). Do not expose Coulisse to mutually untrusted users until per-user deduplication is implemented (tracked in #61).
What Coulisse does NOT do
Coulisse does not parse, extract, or summarise file content. It stores the bytes and forwards them to the LLM backend. If the model supports the file type (e.g. GPT-4o reads PDFs natively), it will process it. If it does not, the request fails at the LLM level — Coulisse surfaces the error as-is.
If you want structured extraction (e.g. parse a CV into memory facts), that is a pattern you implement with a Coulisse agent that calls memory.put — see the per-user memory chapter.
Structured outputs
Coulisse lets the caller pin the shape of the reply, not just its language. Send a JSON Schema and you get back a JSON value that conforms to it — validated server-side before it ever reaches you.
This is the same response_format field OpenAI's API exposes, so existing SDK calls work unchanged. The difference: Coulisse enforces it for every provider, including models that have no native structured-output mode. The schema is taught to the model through the system preamble and the reply is validated (and repaired) on the way out, so anthropic, gemini, groq, cohere, and deepseek behave the same as openai.
How to send it
Add a response_format object to the request. Two shapes are supported.
Any JSON object
{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [{"role": "user", "content": "Give me a config skeleton"}],
"response_format": {"type": "json_object"}
}
The reply is guaranteed to be a single valid JSON value — no markdown fences, no prose.
A specific JSON Schema
{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [{"role": "user", "content": "Extract the person from: Ada Lovelace, 36"}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "person",
"description": "a single person record",
"schema": {
"type": "object",
"properties": {
"age": {"type": "integer"},
"name": {"type": "string"}
},
"required": ["age", "name"],
"additionalProperties": false
}
}
}
}
The json_schema object mirrors OpenAI's: name (required), schema (required, a standard JSON Schema), and optional description and strict. The reply is validated against schema before it's returned.
Omit response_format entirely (or send {"type": "text"}) for a normal free-form reply.
How it reaches the model
Coulisse appends a short instruction to the system preamble before calling the provider — for a json_schema request it embeds the schema, its name, and (if given) its description, and tells the model to emit only the raw JSON value. Your own system messages and the agent's coulisse.yaml preamble are preserved.
After the model replies, Coulisse:
- Extracts the JSON, tolerating a stray markdown code fence if the model added one.
- Validates it — parses it, and for
json_schemachecks it against the schema. - Returns the cleaned JSON as the reply content (re-serialized, so any surrounding prose or fences are stripped).
Repair on failure (non-streaming)
If validation fails, Coulisse re-prompts the same model with its own invalid reply plus the exact validation error, up to two times. Each retry is targeted ("you were missing the required field age"), not a blind re-roll. Token usage across every attempt is summed into the response's usage so billing and rate limits stay accurate.
If the reply still doesn't validate after the retries, the request fails with 502 Bad Gateway — the model couldn't comply.
Streaming
With stream: true, the instruction is injected the same way and tokens stream to you as usual. Coulisse validates the full accumulated reply once the stream ends. Because already-streamed tokens can't be retracted, a validation failure surfaces as an SSE error event rather than a repair retry — so for guaranteed-valid-or-error semantics, prefer non-streaming requests with structured output.
Errors
| Status | When |
|---|---|
400 | The supplied JSON Schema itself is malformed (rejected before any model call). |
502 | The model's reply never validated, even after repair retries. |
{
"error": {
"type": "upstream_error",
"message": "response did not match the schema: \"age\" is a required property"
}
}
Response language
Coulisse lets the caller pin the language the model replies in. Without it, the model infers language from the user's message — which can drift when the user switches languages mid-conversation or types short, ambiguous prompts. With it, every response comes back in the language you asked for.
Language is set per request, via the metadata object. The caller decides — Coulisse doesn't maintain a user-level language preference.
How to send it
Add a language key to metadata. The value is a BCP 47 tag (RFC 5646):
{
"model": "assistant",
"safety_identifier": "user-123",
"metadata": {
"language": "fr-FR"
},
"messages": [
{"role": "user", "content": "Hello!"}
]
}
Any valid BCP 47 tag works: en, fr, fr-FR, es-MX, zh-Hant, ja-JP. The tag is validated — malformed values come back as 400 Bad Request. Omit the key entirely to let the model pick.
How it reaches the model
Coulisse appends a short instruction to the system preamble before calling the provider — something like Always reply in French, even when the user writes in a different language. Do not include translations in any other language.. The instruction is phrased as a hard constraint so the model doesn't mirror the user's language or append a parenthetical translation. For tags in the built-in language-name table (common ISO 639-1 subtags: en, fr, es, de, it, pt, ja, zh, ko, ar, nl, pl, ru, sv, tr, hi), the instruction uses the English name. For anything else, the raw tag is passed through — frontier models understand BCP 47 directly, so cy (Welsh) works fine.
The instruction is added once per request, as the first system message. Your own system messages in the messages array still apply, and agent preambles from coulisse.yaml are preserved.
Real-world example: country code to language
A common pattern is to derive the language from the caller's locale on your side — phone country code, IP-based geolocation, browser Accept-Language, a user profile setting — and forward the resulting tag:
{
"model": "assistant",
"safety_identifier": "+33612345678",
"metadata": {
"language": "fr-FR"
},
"messages": [
{"role": "user", "content": "What's the weather?"}
]
}
Coulisse doesn't do the mapping itself. It takes the tag you send and asks the model to respond in that language. That keeps the metadata format stable and the country-code-to-language table (which changes slowly but does change) out of server code.
Errors
A malformed tag returns 400 Bad Request:
{
"error": {
"type": "invalid_request",
"message": "invalid `metadata.language`: invalid language tag: ..."
}
}
Empty-string and whitespace-only values are rejected the same way.
Token cost tracking
Coulisse converts each chat completion's token usage into a USD cost using a vendored snapshot of LiteLLM's model pricing table. The cost lands in the per-turn llm_call event alongside the raw token counts, so the studio UI shows it next to every model call.
There's nothing to enable. As long as a turn produces token usage and the model is in the table, you'll see a $0.0042-style badge on the corresponding llm_call row in the per-turn event tree.
How it's computed
For each completion Coulisse looks up the configured (provider, model) pair in the vendored table and multiplies:
input_tokens × input_cost_per_tokenoutput_tokens × output_cost_per_tokencache_creation_input_tokens × cache_creation_input_token_cost(Anthropic prompt-cache writes)cached_input_tokens × cache_read_input_token_cost(Anthropic prompt-cache reads)
Missing fields in the upstream table are treated as zero — fine for providers like Groq that don't price cache tokens. Models that don't appear in the table at all yield a null cost: the request still succeeds, the llm_call event still records the token usage, and the studio simply omits the cost badge.
Refreshing the pricing table
The snapshot lives at crates/providers/data/model_prices.json and is checked into git. New models are added upstream regularly; refresh the snapshot with:
just refresh-prices
This downloads the latest version from LiteLLM's main branch and overwrites the local file. The diff lands in git like any other change so you can review what moved before committing.
There's no live fetching at runtime: cost lookup only ever reads from the vendored snapshot. That keeps the request path free of network dependencies and makes pricing updates an explicit, reviewable action.
What's not (yet) covered
- EUR or other currencies. Cost is stored and displayed in USD only. If there's demand for a configurable display currency (
telemetry.display_currency: { code: EUR, usd_rate: 0.92 }-style), it can be added without changing the on-disk format. - Cost-based rate limiting. Rate limits currently work on token counts. Cost is recorded but not yet enforced; a future
usd_per_day:knob would consume the same data. - Per-tool / per-MCP cost. Tool calls have their own
tool_callevents but don't carry a cost themselves. Costs are charged to the parentllm_callevent, which is the only place tokens are spent. - Custom or unlisted models. Self-hosted models or models that LiteLLM hasn't added yet won't have a price. There's no YAML override path today; if you need one, open an issue describing the use case.
Skills
A skill is a reusable bundle of instructions you can hand an agent on demand — the same idea as skills in Claude Code or Codex. You write a folder with a SKILL.md describing how to do something ("review a resume", "negotiate a salary", "triage a bug report"), and any agent that opts in can pull those instructions in exactly when they're relevant.
Not to be confused with the
coulisse skillCLI command, which installs Coulisse itself as a skill into your coding assistant. This page is about theskills:config section — a primitive alongsidemcp, tools, andsubagentsthat your own agents use.
The point is progressive disclosure. An agent's preamble is always in context and costs tokens on every turn. A skill is different: only its one-line description sits in the model's tool list, cheaply advertising that the skill exists. The full body is delivered only when the model decides to use it. You can ship a dozen detailed playbooks without bloating every request.
Writing a skill
A skill is a directory containing a SKILL.md. The file is optional YAML frontmatter followed by a markdown body:
skills/
resume-review/
SKILL.md
rubric.md
---
name: resume-review
description: Review a candidate resume against a role and produce structured feedback.
---
Score the resume on clarity, relevance, and impact. For the scoring rubric and
weights, read the `rubric.md` file bundled with this skill.
Return: a one-line verdict, three strengths, three gaps, and a hire/no-hire lean.
Frontmatter fields:
- name — how the skill is addressed in YAML and exposed to the model as a tool. Optional; defaults to the directory name. Use a tool-safe name (letters, digits,
_,-). - description — the one-line summary the model sees in its tool list. This is what it uses to decide whether to reach for the skill, so write it for the caller, not for yourself.
No frontmatter is fine too — a bare SKILL.md becomes a skill named after its directory with an empty description.
Bundled resource files
Anything else in the skill's directory is a bundled resource the skill body can point at — a rubric, a template, a checklist, a reference doc. The model fetches them on demand through a built-in skill_file tool (one extra level of progressive disclosure: the body loads on use, a referenced file loads only when the model follows the pointer).
Resource access is sandboxed: only files discovered under the skill's own directory at load time are reachable. A skill cannot read outside its folder.
Enabling skills
By default Coulisse scans ./skills — dropping a folder there is all it takes. Point elsewhere with the top-level block:
skills:
dir: ./playbooks
A missing directory is not an error; it simply yields no skills.
Agents opt in by name, the same way they opt into MCP tools and subagents:
agents:
- name: recruiter
provider: anthropic
model: claude-sonnet-4-6
skills: [resume-review, salary-negotiation]
Names that don't match a loaded skill are ignored. An agent with no skills: array gets none.
What the model sees
When an agent has at least one usable skill, its tool list gains:
- one tool per listed skill — named after the skill, described by its
description. Calling it returns the skill's fullSKILL.mdbody. skill_file— reads a bundled resource byskillname andpath(relative to that skill's directory).
A typical flow: the model reads a skill's description, decides it's relevant, calls the skill tool to load the instructions, then follows any pointers to bundled files via skill_file.
Skills vs. MCP tools
Skills carry instructions; MCP servers carry capabilities and side effects. A skill tells an agent how to do something; it does not run code, touch the network, or mutate state. If a skill's procedure needs to execute something — score a document with a script, hit an API, write a file — that step belongs in an MCP tool the skill's body tells the model to call. Keeping the boundary here is deliberate: skills stay pure, inspectable text, and anything with effects goes through MCP where it's configured and observed.
Async tasks
Coulisse's primary surface is the OpenAI-compatible /v1/chat/completions endpoint — synchronous, request/response. That's the right shape for chat-driven workflows where a user is waiting on a reply.
It's the wrong shape for everything else: research that takes minutes, scheduled checks, agents that should keep running after the user closes the tab, narration emitted as work progresses. For those, Coulisse has an async lane built on top of the same agent runtime.
How it works
A tasks table stores work the system has accepted but hasn't completed:
queued → running → done | errored
When something fires off a task — currently the dispatch_task tool from inside an agent run, with cron/webhook/MCP-event triggers planned next — a row lands in the table. A background worker pool inside the same Coulisse process drains the queue: each worker pulls the oldest queued task, transitions it to running, calls the same Agents::complete path the sync HTTP endpoint uses, and writes the final reply (or the error) back to the row.
Workers don't know how their task got enqueued. They just see "run agent X with prompt Y for user Z." That's deliberate — every trigger type produces the same shape of work, so adding new triggers (cron next, then webhooks, then MCP event subscriptions) doesn't touch the worker code.
Dispatching from an agent
Any agent with a configured task queue gets a built-in dispatch_task tool:
{
"name": "dispatch_task",
"description": "Enqueue a fire-and-forget background task...",
"parameters": {
"type": "object",
"properties": {
"agent": { "type": "string" },
"prompt": { "type": "string" }
},
"required": ["agent", "prompt"]
}
}
The agent calls it with the target agent name and an initial prompt; the tool returns a task_id immediately and the worker pool runs it in the background. The dispatching agent gets back only the id — not the result. This is the difference from the synchronous subagent dispatch (subagents: [...] in YAML), which blocks until the target replies.
When to use which:
- Subagent dispatch (sync) — you need the answer before you can continue. "Ask user-tester for friction analysis, then summarize."
dispatch_task(async) — the work is genuinely fire-and-forget, or it's too long to make the caller wait. "Start a research task on X. I'll narrate progress as it runs."
Inspecting from an agent
Agents that get the read side of the queue also see a tasks_status tool:
{
"name": "tasks_status",
"description": "Report recent background tasks across every agent...",
"parameters": {
"type": "object",
"properties": {
"limit": { "type": "integer", "minimum": 1, "maximum": 100 },
"state": { "type": "string", "enum": ["queued", "running", "done", "errored"] }
},
"required": []
}
}
The tool returns a JSON {"tasks": [...]} array, newest first. Each entry carries the agent name, state, a truncated prompt, and the timestamps — enough for an orchestrator to answer "what's going on right now?" from chat, without you having to open /admin/live.
Boot-time reaping
When Coulisse stops mid-task, the worker dies and the row stays at running forever — there's no one to mark it done or errored. On the next coulisse start, before any worker spawns, the queue is swept: every task still in running becomes errored with the reason process restarted before task completed. This pairs naturally with a boot trigger: the wake-up agent sees the reaped rows via tasks_status (filtered by state=errored) and can decide whether to re-dispatch them, escalate, or move on.
Configuration
There's no tasks: YAML section yet — the queue is always on, with four workers by default. A future tasks: block will let you tune worker count and disable the queue entirely if you don't want async work running in your deployment.
Architecture notes
- Lives in
crates/tasks/. Owns thetasksSQLite table; no other crate touches it. - The
TaskQueueandTaskStatustraits live incoulisse-coresoagentscan build thedispatch_taskandtasks_statustools without depending ontasksdirectly. Mirrors the existingScoreLookup/OneShotPrompt/AgentResolverpattern. - Workers run in
cli/src/workers.rs, spawned alongside the HTTP server. They share the sameAgentsruntime — so a background task can call MCP tools, dispatch subagents, exactly like a sync request. - No special shutdown handling yet. Workers die with the process. A graceful drain that lets in-flight tasks finish before exit is on the roadmap.
Triggers
A trigger is a way to start an agent without anyone making an HTTP request. Cron fires on a schedule; webhooks fire on an inbound POST; boot triggers fire once when Coulisse starts. All three convert to the same shape — a task enqueued via the queue — so the agent runtime doesn't know or care how it was summoned.
This is the primitive that makes Coulisse feel like an office instead of a request handler: agents wake up because something happened, not because someone is waiting.
Why this is platform-agnostic
There's no chat-platform-specific code in Coulisse. The webhook trigger (coming next) accepts JSON POSTs from anything that can speak HTTP. Connecting Slack means pointing Slack's built-in outgoing webhooks at Coulisse. Connecting GitHub means setting up a webhook on the repo. Anything else that can POST JSON can summon an agent the same way. Coulisse doesn't know the source — it sees an HTTP request.
The cron trigger is purely internal — zero external dependencies.
Cron triggers
Configure under the top-level triggers: list in coulisse.yaml:
triggers:
- name: daily-standup
type: cron
schedule: "0 9 * * *" # every day at 09:00
agent: pm
prompt: "Standup matin — résume l'activité d'hier en 5 puces."
- name: hourly-watch
type: cron
schedule: "0 * * * *" # every hour at :00
agent: user-tester
prompt: "Une phrase sur le ressenti du moment."
Fields:
- name — stable identifier used in logs and admin views. Must be unique within the file.
- type: cron — the discriminator. Other types (
webhook) arrive later. - schedule — POSIX cron expression. Either 5-field (
min hour day-of-month month day-of-week) or 6-field with leading seconds (sec min …). The 5-field form is normalised to 6-field with a leading0seconds. Schedules are validated at startup; bad expressions refuse to boot. - agent — name of the agent (or experiment) to invoke. Must exist in
agents:/experiments:. - prompt — static user message passed to the agent on each fire. Templating from trigger payload arrives with the webhook trigger.
When the trigger fires, Coulisse enqueues a task and a worker runs the agent through the same handler the sync /v1/chat/completions endpoint uses. The agent gets its full preamble, MCP tools, subagent dispatch, and narration — nothing about background runs is different. Watch them in /admin/live.
User identity
Cron-triggered tasks run as default_user_id (from the top of coulisse.yaml). If unset, they run as a synthetic cron user. Memory partitions are honoured: if daily-standup calls pm with default_user_id: main, it sees the same memory bucket as a human who sends a chat request as main.
Watching cron fire
Tail the log; you'll see one line per arm and one per fire:
INFO cron trigger armed trigger=daily-standup agent=pm
INFO cron trigger fired trigger=daily-standup agent=pm task_id=…
Or open /admin/live — tasks created by triggers appear in the Tasks panel the same way dispatch_task tasks do, with the trigger's prompt as the initial message and the agent name as written in YAML.
Boot triggers
A type: boot trigger fires exactly once when Coulisse starts. Use it for "wake up and decide what to do" prompts that should run on every coulisse start — e.g. asking an orchestrator agent to read the queue's leftovers and decide whether a standup is warranted, without forcing a ritual on every restart.
triggers:
- name: wakeup
type: boot
agent: pm
prompt: |
You just came back online. Check `tasks_status` for what was running
before the stop, look at recent commits, and decide whether to post
a standup. Silence is fine when nothing demands attention.
Fields:
- type: boot — discriminator.
- agent, prompt — same as cron: which agent runs, with what initial message.
The task is enqueued during coulisse start, after the worker pool is up. Combined with the boot-time reaper that marks orphaned running tasks as errored, this gives the wake-up agent everything it needs to assess state and resume work — see Async tasks for the queue semantics.
Webhook triggers
A type: webhook trigger declares an HTTP path; Coulisse exposes POST <path> and fires the trigger on each request. This is the universal connector for outside systems — anything that can POST JSON can summon an agent. No chat-platform code in Coulisse.
triggers:
- name: chat-mention
type: webhook
path: /hooks/chat-mention # must start with /hooks/
agent: pm
prompt: "Message de {{sender}} dans {{room_name}} : {{body}}"
Fields beyond the cron shape:
- type: webhook — discriminator.
- path — HTTP path Coulisse exposes. Must start with
/hooks/to stay clear of the proxy (/v1/*), studio (/admin/*), and OAuth callbacks (/mcp/*). Must be unique across all webhook triggers. - agent — name of the agent (or experiment) to invoke. Accepts the same
{{a.b.c}}templating asprompt, so one webhook can route to different agents based on the inbound payload (see Templated agent below). - prompt — template.
{{a.b.c}}placeholders pull values from the JSON payload by dot-path. Missing paths render as the literal{{ a.b.c }}so debugging is obvious. Static prompts (no placeholders) work too — they pass through unchanged.
Fire it with curl:
curl -X POST http://localhost:8421/hooks/chat-mention \
-H 'Content-Type: application/json' \
-d '{"sender":"alice","room_name":"engineering","body":"@coulisse what is the state of the build?"}'
Response:
{ "ok": true, "task_id": "cb9b91c4-54db-4b8c-a564-08282e643c25" }
The task appears in /admin/live like any other.
Templated agent
The agent field accepts the same {{a.b.c}} templating as prompt. This lets one webhook fan out to different agents based on whatever the inbound payload carries — useful when a bridge POSTs one event per mentioned agent:
triggers:
- name: chat-mention
type: webhook
path: /hooks/chat-mention
agent: "{{agent}}"
prompt: "@{{sender}} in #{{room}}: {{body}}"
The bridge does the iteration on its side and calls the same webhook N times, once per mentioned agent:
curl -X POST http://localhost:8421/hooks/chat-mention \
-d '{"agent":"pm","sender":"almaju","room":"standup","body":"any release blockers?"}'
curl -X POST http://localhost:8421/hooks/chat-mention \
-d '{"agent":"coder","sender":"almaju","room":"standup","body":"any release blockers?"}'
Two tasks land on the queue, one per agent.
A templated agent field is not cross-validated at config load — the value isn't known until a request arrives. If the resolved name doesn't match any agent, the worker errors the task with an "unknown agent" message; you'll see it in /admin/live. If the placeholder fails to resolve at all (the path is missing from the payload), the webhook returns 400 Bad Request and nothing is enqueued.
What's not here yet
- Per-trigger
user_id. Today every trigger fires as the samedefault_user_id. A future field will let triggers run as different synthetic users, useful for partitioning memory between scheduled jobs. - Skip-on-overlap. If a cron fires while the previous run is still going, both queue up. A
skip_if_running: truefield would let users opt into "only one at a time." - Signature verification on webhooks. Anyone who can reach
/hooks/<path>can fire the trigger. For Internet-facing deployments you'd want a shared secret or HMAC check, configurable per trigger. Today the assumption is loopback or trusted network.
Sidecars
A sidecar is a long-lived external process Coulisse spawns alongside itself: a Slack listener, a custom metrics exporter, a bridge to whatever chat platform you use — anything you'd otherwise launch in a separate terminal.
The point is not to add new agent capabilities — agents already get the world via MCP. The point is to keep "one YAML, one start command" honest. If running Coulisse needs you to remember to also run a bridge script, that property has quietly broken.
Coulisse stays platform-agnostic. The sidecars mechanism only knows how to spawn a command, capture its output, and restart it on crash.
Declaring sidecars
sidecars:
- name: chat-bridge
command: chat-bridge/.venv/bin/python
args: [chat-bridge/bridge.py]
env:
BOT_PASSWORD: coulisse-dev
restart: on-failure
- name: heartbeat
command: /bin/sh
args: ["-c", "while true; do echo alive; sleep 60; done"]
restart: always
Fields:
- name — stable identifier; appears in every log line emitted by or about the sidecar. Must be unique.
- command — the executable. Absolute path or anything on
PATH. No shell expansion — quote inside YAML if you need spaces. - args — argv entries, one per list item.
- env — environment variables merged on top of Coulisse's own env.
${VAR}placeholders expand the same way the rest ofcoulisse.yamlexpands them, so secrets don't have to be inlined. - cwd — working directory. Defaults to wherever you ran
coulisse start. - restart —
always/on-failure(default) /never.on-failureskips a clean exit (status code 0); the other two are self-explanatory.
What happens when a sidecar runs
- Coulisse spawns the command in a tokio task at startup.
- The sidecar's stdout and stderr are routed line-by-line into Coulisse's own tracing log, tagged with
sidecar=<name>andstream=stdout|stderr. You'll see them next to MCP messages and request logs. - When the process exits, Coulisse evaluates the restart policy and either backs off for two seconds and respawns, or stops watching the sidecar.
- There's no health check beyond "is the process still alive." If your sidecar hangs without exiting, Coulisse won't notice.
When not to use a sidecar
- If the work is part of the agent flow, expose it as an MCP server instead — that's the abstraction agents actually use.
- If the work is short-lived (a one-shot script), schedule it as a cron trigger that runs a small agent prompt instead.
- If the work needs to outlive Coulisse (database, message broker, homeserver), don't manage it as a sidecar — run it under your real init system (systemd, docker, supervisord). Sidecars die with Coulisse.
Known limitations
- Orphan processes on abrupt shutdown. Tokio's
kill_on_dropsendsSIGKILLto children when theirChildhandle drops, but if Coulisse itself is killed before the runtime can run those destructors, children get reparented to PID 1 and keep running.coulisse stopis a clean SIGTERM; in practice you may needpkill -f <command>to clean orphans up. A graceful-shutdown pass that explicitly SIGTERMs sidecars first is on the roadmap. - No retries-with-backoff. Crash-loop policy is fixed at two seconds. A sidecar that's permanently broken (typo in command, missing dependency) will respawn every two seconds forever.
- No health checks. A hung sidecar that doesn't exit looks alive forever.
- No admin surface. Sidecar state lives only in the log. A future
/admin/sidecarspage would show running / restart-count / last-output.
LLM-as-judge evaluation
Coulisse can score every agent reply with a separate LLM — a judge — and persist the result so you can track quality over time. You describe what to evaluate in the YAML rubric; Coulisse handles scoring shape, format, sampling, and storage.
This is useful for watching agent drift, comparing model/preamble changes, and catching regressions without standing up a separate evaluation pipeline.
How it works
- A client sends a chat request. The agent replies as usual — the judge never blocks the response.
- After the reply is persisted, Coulisse runs each judge the agent opted in to, in a background task.
- Each judge samples according to its
sampling_rate(skip entirely if the draw misses), then asks its backing model to score the assistant's reply against every rubric at once. - The response is parsed into one
scorerow per rubric — persisted under the same user id as the conversation. - Failures (bad JSON, provider error, timeout) are logged at
warnand swallowed — the user already got their answer.
Scores are stored in the same SQLite database as messages and memories, in a scores table keyed by message_id. Averages are computed at read time, not aggregated on write.
YAML
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: You are a helpful assistant.
judges: [quality] # opt in by name
- name: translator
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: Translate into French.
judges: [fluency]
judges:
# Cheap, broad check — 100% of turns, small model.
- name: quality
provider: openai
model: gpt-4o-mini
sampling_rate: 1.0
rubrics:
accuracy: Factual accuracy. Flag hallucinations.
helpfulness: Whether the assistant answered the user's question.
tone: Politeness and tone.
# Targeted check for the translator — only 20% of turns.
- name: fluency
provider: openai
model: gpt-4o-mini
sampling_rate: 0.2
rubrics:
grammar: Grammatical correctness of the French output.
naturalness: How native the phrasing sounds.
The wiring is visible from the agent: when you read an agent block you see which judges score it, rather than having to hunt through the judge list to figure out coverage.
Rubrics
A rubric is a map from criterion name to a short description of what to assess.
rubrics:
accuracy: Factual accuracy. Flag hallucinations.
helpfulness: Whether the assistant answered the user's question.
Keep descriptions terse and assess-able. Don't write scale, format, or JSON instructions into them — Coulisse adds those internally. The description should tell the judge what matters, not how to answer.
Each criterion produces one Score row per scored turn, with its own numeric value and short reasoning. All criteria for one judge are evaluated in a single LLM call, so adding criteria to a judge doesn't multiply cost.
Scoring shape
Every score is an integer in 0..=10 with a one-sentence reasoning. Coulisse forces this shape through the preamble and parses the judge's JSON reply — you don't configure it.
If you need a different scale (e.g. boolean pass/fail, categorical), that will arrive as a future scale: field; the default stays numeric 0-10.
Sampling
sampling_rate controls what fraction of turns are scored.
| Value | Meaning |
|---|---|
1.0 (default) | Score every turn. |
0.1 | Roughly 10% of turns. |
0.0 | Never score (useful to park a judge without deleting it). |
The draw is independent per turn, per judge. Over many turns the scored fraction converges on the configured rate. Lower rates save tokens for expensive judges; broad cheap judges can run at 1.0.
Choosing a judge model
Pick a model that's different from the agent being scored whenever you can. A judge scoring its own output is biased — a cheap cross-provider judge (e.g. gpt-4o-mini judging a Claude agent, or vice versa) is usually closer to neutral.
Strong, slow models make sense for low-volume deep checks (sampling_rate: 0.1). Cheap, fast models make sense for high-volume broad checks (sampling_rate: 1.0).
Multiple judges per agent
Stack judges to get different dimensions at different cost points:
agents:
- name: assistant
provider: anthropic
model: claude-sonnet-4-5-20250929
judges: [broad_check, deep_audit]
judges:
- name: broad_check
provider: openai
model: gpt-4o-mini
sampling_rate: 1.0
rubrics:
helpfulness: Whether the user's question was answered.
tone: Politeness and tone.
- name: deep_audit
provider: anthropic
model: claude-opus-4
sampling_rate: 0.05 # 5% of turns, expensive
rubrics:
accuracy: Factual accuracy, including references and claims.
safety: Harmful, biased, or unsafe content.
Each judge is independent — its own model, rate, and rubric set. A turn can end up with zero, one, or both of these judges scoring it, depending on the sampling draw.
Viewing scores
The studio UI at /admin/ now shows a Scores panel per user. It surfaces two things:
- Averages — mean score per
(judge, criterion)across every turn the user has had, with sample count. - Recent — the most recent individual scores with reasoning.
Validation at startup
Coulisse fails fast on:
- A judge referencing a provider that's not declared under
providers:. - A judge with no rubrics.
- A
sampling_rateoutside[0.0, 1.0]. - An agent referencing a judge name that doesn't exist.
Any violation aborts startup with a message naming the offending judge or agent.
Cost control
Two knobs matter:
sampling_rate— the easy one. Halve it, halve the judge bill.- Judge model — the big one. A
gpt-4o-minijudge at 100% sampling often costs less than agpt-4ojudge at 10%. Pick the cheapest model that gives you a stable signal.
A useful pattern is to run a cheap judge at 100% and a strong judge at a small fraction — the cheap one catches the broad signal, the strong one spot-checks the hardest cases.
Experiments (A/B testing)
Run multiple agent configurations under a single addressable name and let Coulisse pick which one serves each request. Useful for comparing models, preambles, or tool sets without changing client code.
How it works
- Define each candidate as a normal agent under
agents:. - Declare an
experimentwhosenameis what clients send asmodel. - List the candidate agents as variants and choose a strategy.
When a request arrives, the router resolves the experiment name to one variant (and optionally fires off shadow runs in the background). The variant choice is sticky-by-user by default, so the same user always lands on the same variant for a given experiment — conversation memory and persona stay consistent across turns.
Strategies
Three strategies are wired today: split, shadow, and bandit.
split
Weighted random sampling. Sticky by user when sticky_by_user: true (the default) — the variant is a deterministic hash of (user_id, experiment_name) modulo the cumulative weights, with no database writes. Adding or removing a variant reshuffles users.
agents:
- name: assistant-sonnet
provider: anthropic
model: claude-sonnet-4-5-20250929
- name: assistant-gpt
provider: openai
model: gpt-4o
experiments:
- name: assistant # what clients send as model
strategy: split
variants:
- agent: assistant-sonnet
weight: 0.5
- agent: assistant-gpt
weight: 0.5
shadow
Designate one variant as primary; it serves the user normally. The other variants run in the background against the same prepared context, are scored by their judges, and never write to the user's message history. The user never waits on shadow variants.
sampling_rate (default 1.0) controls how often shadow runs fire — set it lower to cap cost.
experiments:
- name: assistant
strategy: shadow
primary: assistant-sonnet
sampling_rate: 0.25 # 25% of turns also run the shadows
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
Use shadow to collect comparison data before flipping a split rollout — the primary serves all real traffic while you build up scoring evidence on the challenger.
bandit
Epsilon-greedy multi-armed bandit. Reads recent mean scores per variant from the existing scores table, picks the leader most of the time (1 - epsilon), and explores a random arm otherwise. Arms with fewer than min_samples recent scores are forced — the bandit only exploits once every arm has enough evidence.
agents:
- name: assistant-sonnet
provider: anthropic
model: claude-sonnet-4-5-20250929
judges: [quality]
- name: assistant-gpt
provider: openai
model: gpt-4o
judges: [quality]
judges:
- name: quality
provider: openai
model: gpt-4o-mini
rubrics:
helpfulness: Whether the assistant answered the user's question.
experiments:
- name: assistant
strategy: bandit
metric: quality.helpfulness # judge.criterion
epsilon: 0.1
min_samples: 30
bandit_window_seconds: 604800 # 7 days
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
The configured judge (quality) and the criterion (helpfulness) must be declared on every variant agent — otherwise the bandit starves on that arm. Validation enforces this at startup.
A note on stickiness: with sticky_by_user: true (the default), the bandit decision is computed at request time via a deterministic hash of (user_id, experiment_name), so a given user typically lands on the same arm. Mean scores update as new data arrives, so a user can shift if a different arm overtakes the leader — that is the trade-off for keeping the assignment stateless.
Namespace and migration
Experiment names share a namespace with agent names. To A/B-test an existing agent without breaking clients:
- Rename the agent (
assistant→assistant-v1). - Add a sibling agent (
assistant-v2). - Add an experiment named
assistantwith both as variants.
Clients keep sending model: assistant and it resolves transparently.
Variants stay individually addressable as agents under their own names (assistant-v1, assistant-v2) — useful for isolating one variant in tests or debugging.
Subagents
A subagent reference can name an agent or an experiment. If orchestrator lists subagents: [assistant] and assistant is an experiment, every subagent call resolves to a variant for the calling user, the same way a top-level request would. Sticky-by-user keeps the variant consistent across the whole conversation.
Give the experiment a purpose: if it's exposed as a subagent — it becomes the tool description the calling agent's LLM sees:
experiments:
- name: assistant
purpose: A general-purpose chat assistant.
strategy: split
variants:
- agent: assistant-sonnet
- agent: assistant-gpt
Bandit subagents read mean scores at call time, so the same exploit/explore behaviour applies inside subagent dispatch.
Telemetry
Each turn's TurnStart event includes agent (the resolved variant), and when an experiment was hit, experiment (the experiment name) and variant (same as agent). Judge scores are tagged with the variant's agent name in the database, so per-variant aggregation flows through the same table without a join — used by the bandit's mean-score query and the studio's per-variant view.
Studio
The studio shows configured experiments at /admin/experiments: strategy, sticky-by-user flag, and per-variant weight + share. For bandit experiments, the page additionally shows the configured metric, epsilon, and min-samples threshold, plus per-variant sample counts and mean scores (loaded inline via htmx from the judges admin endpoints). Shadow experiments call out the primary variant.
Validation
Coulisse rejects the following at startup:
- Experiment name colliding with an agent name (rename one).
- Experiment name colliding with another experiment.
- Experiment with zero variants.
- Variant referencing an undefined agent.
- Variant weight
<= 0. - Duplicate variant agent within one experiment.
- Strategy-specific fields used with the wrong strategy (e.g.
primaryon asplitexperiment). shadowwithout aprimary, or with aprimarythat's not one of the variants.shadowsampling_rateoutside[0.0, 1.0].banditwithout ametric.banditmetricthat doesn't match an existingjudge.criterion, or a variant that doesn't opt into the metric's judge.banditepsilonoutside[0.0, 1.0].
Smoke tests
A smoke test is a synthetic-user persona that drives a conversation against one of your agents (or experiments). Coulisse plays the user — you write a preamble describing who they are and what they want — and the assistant replies for real. Every assistant turn flows through the same judge pipeline as production traffic, so you get a transcript and scores back without writing any harness code.
Smoke tests are most useful when you're iterating on a prompt: tweak the preamble, hit "Run now" in the studio, watch the scores. Pair them with experiments and a single click runs every variant once, sticky-by-user routing samples them across repetitions, and the judge scores feed straight into bandit selection.
How it works
- You trigger a run from the studio (
/admin/smoke/<name>) — no client needed. - Coulisse opens a fresh synthetic user id and starts a loop:
- The persona model produces a "user" message — given the conversation so far with roles flipped (so the model speaks as the user).
- The target agent replies as it normally would, with all its real MCP tools, subagents, and preambles.
- The reply is fanned out to every judge the target agent opts into. Scores land in the same
scorestable as production runs, keyed by the assistant turn's id.
- The loop stops when either side emits the configured
stop_marker, or whenmax_turnsis hit. - The full transcript is browsable at
/admin/smoke/runs/<run_id>— assistant in slate, persona in amber.
Smoke runs never write to the user's memory or rate-limit windows. Each repetition uses a brand-new synthetic user id, so split/bandit experiments naturally sample variants across reps.
YAML
smoke_tests:
- name: jobseeker_basic
target: tremplin # agent or experiment name
persona:
provider: anthropic
model: claude-haiku-4-5-20251001
preamble: |
You are role-playing a 28-year-old looking for a developer job in Paris.
Reply like a real human: short questions, follow-ups as the conversation goes.
When you have a satisfactory answer, finish with "[FIN]".
initial_message: "Hi, I'm looking for work."
stop_marker: "[FIN]"
max_turns: 10
repetitions: 5
| Field | Required | Default | Notes |
|---|---|---|---|
name | yes | Unique within smoke_tests. Shows up at /admin/smoke/<name>. | |
target | yes | Agent name or experiment name. Resolved through the experiment router per run. | |
persona | yes | Provider, model, and preamble for the synthetic user. | |
initial_message | no | Hard-coded first message from the persona. Skipping this lets the persona open the conversation. | |
stop_marker | no | Substring that ends the run when emitted by either side. | |
max_turns | no | 10 | Cap on persona-then-agent pairs. |
repetitions | no | 1 | Independent runs launched per "Run now" click. Each gets a fresh synthetic user id. |
Iterating with experiments
Define two variants of an agent (e.g. assistant-v1, assistant-v2), wrap them in a bandit experiment, and target the experiment name from a smoke test:
experiments:
- name: assistant
strategy: bandit
metric: quality.helpfulness
variants:
- agent: assistant-v1
- agent: assistant-v2
smoke_tests:
- name: convergence
target: assistant
repetitions: 50
persona: { provider: openai, model: gpt-4o-mini, preamble: "..." }
Hit "Run now" once and the bandit accumulates 50 samples per variant per turn pair. The experiment page picks the winner on its own.
Limitations (today)
- Smoke runs bypass the memory pipeline. Fact extraction and semantic recall are not exercised.
- No scheduled runs — trigger is manual via the studio.
- No tool-call assertions; assertions about what the agent did during a turn live in the judge rubrics.
Telemetry
The telemetry: block controls observability — what Coulisse logs to stderr, what it persists to SQLite for the studio UI, and whether it ships traces to your own OpenTelemetry backend.
Every field has a sensible default. Omit the block and you get stderr logs at info plus the studio's per-turn event tree, with no external traces.
Shape
telemetry:
fmt:
enabled: true # stderr logs; default on
sqlite:
enabled: true # mirrors spans into the studio's tables; default on
otlp: # absent = disabled (default)
endpoint: "http://localhost:4317"
protocol: grpc # or http_binary
service_name: coulisse
headers:
authorization: "Bearer ${OTEL_API_KEY}"
All three layers compose. Turn sqlite off if you don't need the studio. Add otlp to ship the same traces to Grafana, SigNoz, Jaeger, Honeycomb, or any OTLP-compatible backend.
telemetry.fmt
| Field | Type | Required | Notes |
|---|---|---|---|
enabled | bool | no | Default true. |
Writes structured logs to stderr. The level is controlled by the RUST_LOG environment variable; without it, the default is info,sqlx=warn (info from Coulisse, warnings only from the SQL driver). To see internal SQL traffic, run with RUST_LOG=debug. To silence everything, set RUST_LOG=error.
telemetry.sqlite
| Field | Type | Required | Notes |
|---|---|---|---|
enabled | bool | no | Default true. |
Mirrors turn and tool_call tracing spans into the events and tool_calls tables that the studio UI reads. Without this layer, the studio loses its per-turn event tree and tool-call panel.
The schema is part of the same SQLite file the rest of Coulisse persists into (.coulisse/coulisse-memory.db).
telemetry.otlp
Absent (the default) means Coulisse does not export traces externally. To plug Coulisse into your own observability stack, set the block:
| Field | Type | Required | Notes |
|---|---|---|---|
endpoint | string | yes | Collector URL. |
protocol | enum | no | grpc (default) or http_binary. |
service_name | string | no | OpenTelemetry resource attribute service.name. Default coulisse. |
headers | map | no | Static HTTP/gRPC headers attached to every export. |
Endpoint defaults
- gRPC (the default): port
4317, e.g.http://localhost:4317. - HTTP/protobuf: port
4318, e.g.http://localhost:4318/v1/traces.
The collector you point at decides the rest — Coulisse ships traces with service.name = coulisse and span names turn, tool_call, and llm_call. Span fields carry user_id, turn_id, agent, tool_name, kind, and the rest documented in the features chapter.
Headers
Useful for managed backends:
telemetry:
otlp:
endpoint: "https://ingest.us.signoz.cloud:443"
protocol: grpc
headers:
"signoz-access-token": "${SIGNOZ_TOKEN}"
YAML doesn't expand ${...} itself; substitute at deploy time (helm, envsubst, sops, etc.).
How the layers compose
The cli installs a single tracing_subscriber registry with the layers your config asked for, in order:
RUST_LOGenv filterfmt→ stderr (whenfmt.enabled)sqlite→events+tool_callstables (whensqlite.enabled)otlp→ external collector (whenotlpis set)
Every span emitted by the running server fans out to all enabled layers. There is no priority or fallback — the SQLite layer keeps full payloads (full prompts, args, results), the OTLP layer ships the same span attributes to your collector. If your backend chokes on multi-megabyte attributes, drop those fields in your collector pipeline rather than at the source.
Telemetry
Coulisse emits its own observability via the tracing crate. Every request opens a turn span; every tool invocation (MCP or subagent) opens a child tool_call span. The configured layers — fmt, SQLite, and optionally OTLP — receive those spans and route them where you've asked for.
The result: the studio UI gives you an offline audit trail, and any OpenTelemetry-compatible backend (Grafana, SigNoz, Jaeger, Honeycomb, ...) gives you live traces. They're driven from the same source — there's no separate path.
Span model
| Span name | Opened when | Fields |
|---|---|---|
turn | a chat completion request arrives | agent, experiment (when applicable), turn_id, user_id, user_message |
tool_call | an MCP or subagent tool fires | args, error (on failure), kind (mcp/subagent), result, tool_name |
llm_call | a chat completion finishes (token usage is known) | cost_usd (when the model is in the pricing table), model, provider, usage |
turn is the root; tool_call and llm_call nest under it via the tracing span tree, so OTLP backends render them as a trace tree out of the box.
Studio integration
When telemetry.sqlite.enabled is true (the default), the studio's per-turn event tree and tool-call panel render directly from the same spans. Nothing extra to wire up — open /admin/ and the tree is there.
OTLP backends
Set telemetry.otlp.endpoint to start exporting. The exporter batches spans, retries on transient failures, and shuts down cleanly on process exit so in-flight spans land before the server stops.
Tested with:
- Grafana (Tempo / Cloud) — gRPC at
4317. - SigNoz (self-hosted or Cloud) — gRPC; for Cloud add a
signoz-access-tokenheader. - Jaeger — gRPC at
4317(Jaeger ≥ 1.50 speaks OTLP natively). - Honeycomb — HTTP/protobuf at
https://api.honeycomb.io/v1/traceswithx-honeycomb-teamheader.
Tuning verbosity
The fmt layer (stderr logs) is controlled by RUST_LOG:
RUST_LOG=info,sqlx=warn coulisse # default
RUST_LOG=debug coulisse # verbose, including SQL driver
RUST_LOG=warn coulisse # quiet
RUST_LOG=coulisse=debug,agents=trace coulisse # per-crate filtering
The SQLite and OTLP layers are not affected by RUST_LOG — they capture every turn / tool_call / llm_call span regardless of log level.
Disabling layers
Each layer has its own enabled flag. Common combinations:
# Production with external observability stack
telemetry:
sqlite:
enabled: false # studio not exposed; no need to keep DB rows
otlp:
endpoint: "..."
# Local development, no external backend
telemetry:
# default fmt + sqlite
# CI / load tests — minimize logging overhead
telemetry:
fmt:
enabled: false
sqlite:
enabled: false
CLI reference
Coulisse ships as a single binary with a handful of subcommands. Every
subcommand accepts -c, --config <PATH> (default coulisse.yaml) and
honors the COULISSE_CONFIG env var as a fallback.
State files (coulisse.pid, coulisse.log) live in a .coulisse/
directory next to the config file — this keeps state co-located with
the project and makes cd && coulisse stop "just work."
coulisse init
Write a starter coulisse.yaml in the current directory.
coulisse init # minimal template (one OpenAI agent + sqlite memory)
coulisse init --from-example # full annotated example (every section, every option)
coulisse init --force # overwrite an existing coulisse.yaml
coulisse start
Start the server, detached by default. Returns once the server has written its PID file (or fails if the boot times out within 5 seconds).
coulisse start # detached background server
coulisse start --foreground # attached: logs stream to the terminal
coulisse start -F # short form
A bare coulisse invocation is equivalent to coulisse start --foreground — the historical pre-subcommand behavior is preserved.
When detached, stdout/stderr are appended to .coulisse/coulisse.log.
coulisse stop
Send SIGTERM to a running detached server (PID read from
.coulisse/coulisse.pid).
coulisse stop # graceful: SIGTERM, wait up to 10s
coulisse stop --force # SIGKILL (use if the server is wedged)
Stop is a no-op if the server isn't running — stale PID files left over from crashes are detected and removed.
coulisse restart
Equivalent to coulisse stop && coulisse start.
coulisse reset
Delete the SQLite database, wiping all stored state — conversation
memory, long-term memories, telemetry, judge scores, rate-limit windows,
background tasks, and API tokens. Your coulisse.yaml is never touched.
Destructive and irreversible, so it refuses to run while a server holds the
database open (stop it first), and prompts for confirmation unless -y is
passed. Removes the database file (.coulisse/coulisse-memory.db) plus its
-wal/-shm sidecars.
coulisse reset # warns, lists the files, asks to confirm
coulisse reset -y # skip the prompt (for scripts / fresh starts)
coulisse status
Report whether the detached server is running and where its files live.
running (pid 31427)
config: ./coulisse.yaml
log: ./.coulisse/coulisse.log
coulisse studio
Open the studio UI (/admin/) in the default web browser. Requires
the server to be running — start it first with coulisse start.
coulisse studio # also: coulisse admin
# opening http://localhost:8421/admin/
The URL honors server.port from coulisse.yaml, so multiple Coulisse
instances on different ports each open their own studio.
coulisse token
Mint, list, and revoke the self-issued API tokens that gate /v1/* when
auth.proxy.tokens is enabled. Operates on
the same database the running server uses, so changes are live immediately.
coulisse token create laptop --principal alice # unlimited
coulisse token create ci --principal alice \
--budget monthly --limit 20 # $20 / month cap
coulisse token list # tokens + spend
coulisse token revoke <id> # immediate 401 for clients
create prints the secret (sk-coulisse-…) to stdout — shown only once —
and the id/context to stderr, so coulisse token create … > key.txt
captures just the key.
coulisse check
Load and validate the YAML without starting the server. Catches schema errors and cross-reference issues (agent → provider, agent → judge, experiment variant → agent, ...) before a real start.
coulisse check
# ok — coulisse.yaml (3 agents, 1 judges, 0 experiments, 2 providers)
coulisse schema
Emit the JSON Schema for coulisse.yaml to stdout. Redirect to a file
next to your config and reference it for IDE autocompletion and
validation:
coulisse schema > coulisse.schema.json
# yaml-language-server: $schema=./coulisse.schema.json
Picked up by the VS Code YAML extension, Helix, Neovim, Zed, JetBrains — anything that speaks the yaml-language-server directive. The schema is generated from the same Rust types that parse the config, so it never drifts.
coulisse update
Fetch the latest release from GitHub and replace the running binary
in place. Detects the host target triple (e.g.
aarch64-apple-darwin) and downloads the matching cargo-dist
artifact. No-op if you're already on the latest version.
coulisse update
# checking for updates...
# updated to 0.2.0
The binary needs write permission to its own path — if you installed
under /usr/local/bin you may need sudo.
State directory layout
your-project/
├── coulisse.yaml
└── .coulisse/
├── coulisse.pid # written by `start`, removed on clean exit
├── coulisse.log # detached stdout/stderr
├── secrets.env # MCP OAuth encryption keys (when configured)
├── files/ # uploaded file blobs (fs storage backend)
└── coulisse-memory.db # SQLite database
.coulisse/ holds the whole runtime footprint of one project under a
single directory: the SQLite database, uploaded files, logs, PID, and
secrets all land here, and the paths are not configurable. Mount this one
directory to persist Coulisse's state in Docker.
HTTP API
Coulisse listens on 0.0.0.0:8421 and exposes an OpenAI-compatible surface.
POST /v1/chat/completions
The main chat endpoint. Accepts the standard OpenAI chat completion request shape.
Request
{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [
{"role": "user", "content": "Hello!"}
]
}
| Field | Required | Notes |
|---|---|---|
messages | yes | The usual OpenAI message array. At least one user message is required. |
metadata | no | Optional map of strings. Used for per-request rate limits — see below. |
model | yes | Name of an agent from your config. |
response_format | no | Pin the reply shape: {"type": "json_object"} or {"type": "json_schema", "json_schema": {…}}. Validated and enforced for every provider — see Structured outputs. |
safety_identifier | yes¹ | Identifies the user. Can be any stable string. |
stream | no | When true, the response is an SSE stream of chat.completion.chunk frames. See Streaming responses. |
stream_options | no | Object. include_usage: true adds the usage field to the terminal stream chunk. |
user | — | Deprecated OpenAI field; accepted as a fallback. |
¹ Required unless a default_user_id is set in coulisse.yaml — see User identification.
Recognized metadata keys
metadata is a passthrough map of strings. Coulisse interprets the following keys; any other keys are ignored.
| Key | Type | Meaning |
|---|---|---|
language | BCP 47 tag | Forces the response language, e.g. fr-FR. See Response language. |
tokens_per_day | integer (as string) | Max tokens per rolling day. |
tokens_per_hour | integer (as string) | Max tokens per rolling hour. |
tokens_per_month | integer (as string) | Max tokens per rolling 30-day window. |
All optional. See Rate limiting for the token-limit behavior.
Response
Standard OpenAI chat completion response:
{
"id": "...",
"object": "chat.completion",
"created": 1714000000,
"model": "assistant",
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": "Hi!"},
"finish_reason": "stop"
}
]
}
Streaming
Set stream: true to receive chat.completion.chunk frames over Server-Sent Events instead of one JSON response. The full wire format and disconnect semantics live in Streaming responses.
Errors
Errors come back in OpenAI's error shape:
{
"error": {
"type": "invalid_request_error",
"message": "safety_identifier is required",
"code": null
}
}
Common cases:
- 400 — missing
safety_identifier(when required), no user message, unknown agent name, unparseablemetadatavalues, a malformedresponse_formatJSON Schema. - 429 — per-user token limit exceeded. Includes a
Retry-Afterheader with seconds until the window resets. See Rate limiting. - 5xx — upstream provider error, MCP server failure, a
response_formatreply that never validated after repair retries. See Structured outputs.
GET /v1/models
Lists every agent defined in the config.
Response
{
"object": "list",
"data": [
{"id": "assistant", "object": "model", "owned_by": "coulisse"},
{"id": "code-reviewer", "object": "model", "owned_by": "coulisse"}
]
}
Useful for UI dropdowns that want to populate a model picker from the server.
Admin / config endpoints
Everything under /admin/* is a single content-negotiated surface. The same routes serve HTML pages to browsers, HTML fragments to htmx, and JSON to scripts — set Accept: application/json (or send an HX-Request header) to switch representation. Request bodies are equally tolerant: application/json, application/yaml, and application/x-www-form-urlencoded all deserialize into the same target type.
All admin routes are gated by the auth.admin scope.
Agents
| Method | Path | Body | Notes |
|---|---|---|---|
GET | /admin/agents | — | List configured agents (HTML or JSON). |
POST | /admin/agents | AgentConfig | Create a new agent. 409 if the name is taken. |
GET | /admin/agents/{name} | — | Detail (HTML or JSON). |
PUT | /admin/agents/{name} | AgentConfig | Replace the named agent. Body name must match URL. |
DELETE | /admin/agents/{name} | — | Remove the named agent. |
GET | /admin/agents/new | — | HTML form for a new agent. |
GET | /admin/agents/{name}/edit | — | HTML edit form. |
AgentConfig is the same shape used in coulisse.yaml: name, provider, model, preamble, purpose (optional), judges (list, optional), subagents (list, optional), mcp_tools (list, optional).
Judges, experiments, providers, MCP servers
Same CRUD shape as agents — list / create / one / update / delete. Adjust the path to suit:
| Path | Body | Notes |
|---|---|---|
/admin/judges + /admin/judges/{name} | JudgeConfig | LLM-as-judge evaluators. |
/admin/experiments + /admin/experiments/{name} | ExperimentConfig | A/B routing groups. The runtime ExperimentRouter rebuilds on restart; admin display reflects the file in real time. |
/admin/providers + /admin/providers/{kind} | ProviderConfig (just api_key); POST body adds kind | Where {kind} is one of anthropic, cohere, deepseek, gemini, groq, openai. The runtime client is built at boot — restart to swap. |
/admin/mcp + /admin/mcp/{name} | McpServerConfig (transport: stdio + command/args/env, or transport: http + url); POST body adds name | Connections open at boot — restart to attach a new server. |
Whole-file config
| Method | Path | Body | Notes |
|---|---|---|---|
GET | /admin/config | — | Returns the file contents (application/yaml by default, JSON when Accept: application/json). |
PUT | /admin/config | full YAML/JSON | Replaces coulisse.yaml atomically. Validation runs before any disk write. |
GET | /admin/openapi.json | — | OpenAPI 3.1 description of every admin route, including request/response schemas. Feed it to openapi-generator or any client codegen for typed SDKs. |
Validation, hot reload, the file watcher
Every write — admin form save, JSON PUT, hand-edit in $EDITOR — flows through the same pipeline:
- The body is merged into the on-disk YAML (preserving sections this binary doesn't recognize).
- The full result is deserialized into a
Configand run through cross-feature validation (provider references, judge references, experiment variants, …). - Only on success does anything touch disk: a temp file is written and renamed atomically.
- The file watcher fires, the new config is reloaded, and feature crates' hot-reloadable state (agent list, judges list, experiments list, settings view) atomically swaps in.
Errors return the validator's message verbatim with a 422 Unprocessable Entity (or 400 for malformed bodies). The on-disk file is unchanged when validation rejects a write.
The studio UI is just one client of these endpoints — see Studio UI for what the rendered surface offers and authentication options.
Auth
By default Coulisse leaves /v1/* open. Configure the auth.proxy scope in YAML to require Basic credentials or OIDC for SDK clients; configure auth.admin to gate the studio. See Studio UI for the schema. Anything you don't gate is your responsibility to terminate at the infrastructure layer (reverse proxy, API gateway, VPN).
YAML schema
A complete reference for every field in coulisse.yaml.
IDE autocompletion and validation
Coulisse derives a JSON Schema from the Rust types that parse the YAML, so your editor can autocomplete and lint the config live. Generate the schema next to your config:
coulisse schema > coulisse.schema.json
Then reference it from the top of coulisse.yaml with the yaml-language-server directive (recognised by the VS Code YAML extension, Helix, Neovim, Zed, JetBrains, etc.):
# yaml-language-server: $schema=./coulisse.schema.json
The schema is also shipped at the repo root as coulisse.schema.json and is the single source of truth for the field tables below — they describe the same shape in prose.
Environment variables
Any string value in coulisse.yaml can reference an environment variable with ${VAR_NAME}:
providers:
anthropic:
api_key: ${ANTHROPIC_API_KEY}
openai:
api_key: ${OPENAI_API_KEY}
Coulisse expands all ${...} placeholders before parsing the YAML, so substitution works in any field — API keys, URLs, tokens, passwords, MCP env blocks, etc.
If a referenced variable is not set in the environment, the server refuses to start and prints an error naming the missing variable. An unclosed ${ with no matching } is also rejected at startup.
Config variables
Named text snippets declared under a top-level vars: block and spliced into other string fields with ${vars.<name>}. Useful for sharing a preamble footer across agents, repeating a path, or factoring any string that would otherwise duplicate.
vars:
team_footer: |
Team: @pm, @coder, @qa
Rooms: #standup, #engineering, #worklog
agents:
- name: pm
provider: anthropic
model: claude-opus-4-7
preamble: |
You are the PM.
${vars.team_footer}
- name: coder
provider: anthropic
model: claude-sonnet-4-6
preamble: |
You are the coder.
${vars.team_footer}
${vars.<name>} is resolved after environment-variable expansion, so a var's value can itself contain ${VAR} references. Substitution is single-pass: a substituted value containing ${vars.x} is not re-expanded. Unknown ${vars.x} references abort startup with the offending line.
Multi-line var values inherit the placeholder's leading indent — every line after the first gets prefixed with the same whitespace as the line containing ${vars.x}. This lets a snippet splice cleanly into a YAML block scalar (preamble: |) without breaking the indentation contract.
Top-level
agents: [ ... ] # required, non-empty
auth: { ... } # optional; per-scope auth for /v1/* and /admin/*
default_user_id: <string> # optional, unset by default
experiments: [ ... ] # optional; A/B test groups over agents
judges: [ ... ] # optional; empty/omitted = no evaluation
mcp: { ... } # optional
memory: { ... } # optional; defaults to sqlite history, no long-term memory
providers: { ... } # required
public_base_url: <string> # optional; used for MCP OAuth redirect URIs (default: http://localhost:{port})
server: { ... } # optional; bind/port/threads/body-limit (defaults to 0.0.0.0:8421)
sidecars: [ ... ] # optional; long-lived helper processes Coulisse spawns alongside itself
skills: { ... } # optional; skill directory (defaults to ./skills)
smoke_tests: [ ... ] # optional; synthetic-user evaluation runs
storage: { ... } # optional; file upload backend (default: fs, no quota)
telemetry: { ... } # optional; fmt + sqlite on by default, OTLP opt-in
triggers: [ ... ] # optional; cron / webhook / boot
vars: { name: value, ... } # optional; named snippets referenced via ${vars.<name>}
auth
- Type: object
- Optional. Omit to leave both surfaces unauthenticated (fine for local dev, never for anything exposed beyond loopback).
Two independent scopes:
auth.proxyguards the OpenAI-compatible/v1/*surface that SDK clients call.auth.adminguards the/admin/*surface (the studio UI).
Each scope is itself optional and accepts the same shape: exactly one of basic, oidc, or tokens when present (tokens on the proxy scope only). They are mutually exclusive within a scope — the server rejects a scope block that has more than one or none. The two scopes are independent, so you can enable Basic on one and OIDC on the other.
auth.<scope>.basic
Static HTTP Basic credentials. Best for local dev or a single-operator deployment.
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
password | string | yes | — | Non-empty. Rotate if suspected leaked — there's no token revocation. |
username | string | no | admin | Non-empty when set. |
auth:
admin:
basic:
password: choose-something-strong
username: admin
auth.<scope>.oidc
Authorization-code-with-PKCE login against an OIDC-compliant IdP (Authentik, Keycloak, Auth0, Google, etc.). Access control is delegated to the IdP's application policy — Coulisse accepts any successfully authenticated user.
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
client_id | string | yes | — | Must match the client registered at the IdP. |
client_secret | string | no | — | Required for confidential clients (Authentik's default); omit for public clients using PKCE only. |
issuer_url | string | yes | — | IdP issuer. For Authentik: https://<host>/application/o/<app-slug>/. |
redirect_url | string | yes | — | Public base URL inside the protected scope. Must be registered as the redirect URI at the IdP. axum-oidc allows every subpath of this URL as a valid redirect. |
scopes | list<string> | no | [email, profile] | Extra OAuth2 scopes. openid is added automatically. |
auth:
admin:
oidc:
issuer_url: https://authentik.example.com/application/o/coulisse/
client_id: coulisse-admin
client_secret: <secret>
redirect_url: http://localhost:8421/admin/
auth.proxy.identity
How the per-user identity that partitions memory, recall, MCP sessions, and rate limits is derived. Only valid on the proxy scope — the admin surface has no per-user partitioning, so from_credential there is rejected at startup.
| Value | Behavior |
|---|---|
from_request | Default. Trust the safety_identifier (or deprecated user) field in the request body. Correct for single-user setups and trusted first-party backends that set the identifier on behalf of their own authenticated users. |
from_credential | Derive the identity from the authenticated principal — the Basic username or the OIDC sub claim. A request body claiming a different safety_identifier is rejected with 403. Use this for adversarial multi-tenant serving, where clients cannot be trusted to declare their own identity. |
from_credential requires auth.proxy to declare basic or oidc (you can't bind to a credential that isn't checked), and is mutually exclusive with default_user_id — a shared default bucket would bypass the binding. With Basic, every distinct user needs distinct credentials, since the username is the identity; OIDC gives each user a distinct sub automatically.
auth:
proxy:
oidc:
issuer_url: https://authentik.example.com/application/o/coulisse/
client_id: coulisse-proxy
client_secret: <secret>
redirect_url: http://localhost:8421/v1/
identity: from_credential # user = the OIDC subject, not the request body
auth.proxy.tokens
Self-issued API tokens — Coulisse mints sk-coulisse-… bearer keys, stores only their hash, and gates /v1/* on them. Set the (currently empty) block to turn the scheme on; tokens are then created at runtime, never in YAML:
auth:
proxy:
tokens: {} # enable bearer-token auth on /v1/*
Clients authenticate exactly like the OpenAI API: Authorization: Bearer sk-coulisse-…. Each token binds to a principal (the user id that partitions memory, recall, and rate limits), so token auth always implies credential-bound identity — a request body claiming a different safety_identifier is rejected with 403, and default_user_id does not apply.
Mint, monitor spend on, and revoke tokens from the studio's Tokens page or the coulisse token CLI. Each token carries a budget — unlimited, a lifetime cap, or a per-calendar-month cap; a request that would exceed it is rejected with 429 insufficient_quota before any provider call. See API tokens.
default_user_id
- Type: string
- Default: unset
- Purpose: fallback identifier for requests that don't supply
safety_identifier(or the deprecateduser).
Leave it unset for multi-tenant deployments — unidentified requests will be rejected. Set it to something like "main" for local or single-user setups so memory still works whether or not the client bothers to send an id. See User identification.
providers
- Type: map of
provider_kind → provider_config - Required. At least one provider must be declared.
Supported keys
anthropic, cohere, deepseek, gemini, groq, openai.
Per-provider fields
| Field | Type | Required | Notes |
|---|---|---|---|
api_key | string | yes | Provider API key. |
providers:
anthropic:
api_key: ${ANTHROPIC_API_KEY}
openai:
api_key: ${OPENAI_API_KEY}
mcp
- Type: map of
server_name → server_config - Optional. Omit if you don't use tools.
Server names are arbitrary — they're what agents refer to under mcp_tools.
A server is either remote (declare a url:) or local (declare a
command:). The transport is inferred — a url: is HTTP, or SSE if the path
contains /sse; a command: is stdio — but you can pin it with an explicit
transport:.
Common fields
| Field | Type | Required | Notes |
|---|---|---|---|
transport | enum | no | http, sse, or stdio. Inferred from url/command when omitted; set it to force a transport (e.g. sse on a URL without /sse). |
Remote (url)
| Field | Type | Required | Notes |
|---|---|---|---|
url | string | yes | MCP endpoint. HTTP, or SSE when the path contains /sse. |
oauth | varies | no | Per-user OAuth is on by default for URL-based servers (discover mode). Set false to disable on a no-auth server, { scopes: [...] } to override scopes, or a full { mode: static, ... } block for providers without Dynamic Client Registration. See Per-user OAuth for MCP servers. |
Local (command)
| Field | Type | Required | Notes |
|---|---|---|---|
command | string | yes | Executable to run (stdio transport). |
args | list<str> | no | Command-line arguments. |
env | map<str,str> | no | Environment variables for the child. |
Examples
mcp:
hello:
command: uvx
args: [--from, git+https://..., hello-mcp-server]
calculator:
url: http://localhost:8080
oauth: false # no-auth server, skip the connect flow
todoist:
url: https://ai.todoist.net/mcp # per-user OAuth implied
memory
- Type: object
- Optional. Omit for defaults: SQLite at
.coulisse/coulisse-memory.db, history-only (no long-term user state).
See Memory configuration for the full walkthrough and examples.
Sub-fields
The database always lives at .coulisse/coulisse-memory.db; its location is
not configurable. The only sub-field is user_state.
| Field | Type | Required | Default |
|---|---|---|---|
user_state | bool or object | no | false |
user_state.embed_with | object | no | auto-picked from providers: |
user_state.learn_from | object | no | auto-picked from providers: |
user_state.dedup_threshold | float | no | 0.9 |
user_state.max_facts_per_turn | int | no | 5 |
user_state.recall_k | int | no | 5 |
agents
- Type: list of agent configs
- Required. At least one agent must be defined.
Per-agent fields
| Field | Type | Required | Notes |
|---|---|---|---|
name | string | yes | Unique agent identifier; clients pass this as model. |
provider | string | yes | Key under providers. |
model | string | yes | Upstream model identifier. |
preamble | string | no | System prompt. Default: empty. |
judges | list<string> | no | Names of judges (from top-level judges:) that evaluate this agent's replies. Empty = no evaluation. |
max_turns | integer | no | Maximum tool-calling rounds per turn. Default: 8. Raise for agents that chain many tool calls (e.g. a coder that reads files, edits, and dispatches to QA in one go). |
mcp_tools | list<mcp_tool_access> | no | Tools this agent may use. |
purpose | string | no | Tool description when this agent is exposed via another agent's subagents. Omit for standalone agents; add a concrete one-line description when this agent is meant to be called as a specialist. |
skills | list<string> | no | Names of skills (from the top-level skills: directory) this agent may use. Each becomes a tool advertised by its description; calling it returns the skill's instructions. Unknown names are ignored. See Skills. |
subagents | list<string> | no | Names of other agents exposed as callable tools. Each entry must refer to another entry under agents. Self-reference and duplicates are rejected at startup. |
mcp_tools entry
| Field | Type | Required | Notes |
|---|---|---|---|
server | string | yes | Key under mcp. |
only | list<str> | no | Allowed tool names. Omit for full access. |
Complete agent example
agents:
- name: code-reviewer
provider: anthropic
model: claude-sonnet-4-5-20250929
preamble: |
You are a thorough code reviewer.
mcp_tools:
- server: filesystem
only:
- read_file
- server: hello
Subagent example
agents:
- name: resume_critic
provider: anthropic
model: claude-sonnet-4-5-20250929
purpose: Critique and rewrite a resume for a target role.
preamble: |
Given a resume and a target role, return a revised resume
and a bullet list of the biggest gaps.
- name: coach
provider: anthropic
model: claude-sonnet-4-5-20250929
subagents: [resume_critic]
preamble: |
Delegate resume work to `resume_critic` when relevant.
See Multi-agent routing for the full subagent walkthrough.
experiments
- Type: list of experiment configs
- Optional. Omit (or leave empty) to skip A/B testing.
An experiment wraps two or more agents under one addressable name. Clients send the experiment's name in the model field and the router picks a variant per request. Experiment names share the agent namespace — collisions are rejected at startup.
See Experiments for the end-to-end walkthrough.
Per-experiment fields
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
bandit_window_seconds | int | no (bandit) | 604800 (7 d) | Bandit-only. Maximum age of scores included in mean-arm computations. |
epsilon | float | no (bandit) | 0.1 | Bandit-only. Probability in [0.0, 1.0] of routing to a random arm instead of the leader. |
metric | string | yes (bandit) | — | Bandit-only. judge.criterion to optimise. The judge must declare the criterion in its rubrics, and every variant must opt into the judge. |
min_samples | int | no (bandit) | 30 | Bandit-only. Each arm must accumulate this many scores before exploitation is allowed. |
name | string | yes | — | Addressable name; must not collide with any agent name. |
primary | string | yes (shadow) | — | Shadow-only. Variant agent that serves the user. Must be one of variants. |
purpose | string | no | — | Tool description when the experiment is exposed via another agent's subagents:. |
sampling_rate | float | no (shadow) | 1.0 | Shadow-only. Probability in [0.0, 1.0] that a turn also runs the non-primary variants in the background. |
sticky_by_user | bool | no | true | When true, the same user always lands on the same variant (deterministic hash, no DB writes). |
strategy | enum | yes | — | split, shadow, or bandit. |
variants | list<variant> | yes | — | Non-empty. Each entry references an agent. |
variants entry
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
agent | string | yes | — | Name of an agent declared under top-level agents:. Variants must reference concrete agents — nesting an experiment is rejected. |
weight | float | no | 1.0 | Strictly positive. Normalised against the sum of all variant weights. |
Example
agents:
- name: assistant-sonnet
provider: anthropic
model: claude-sonnet-4-5-20250929
- name: assistant-gpt
provider: openai
model: gpt-4o
experiments:
- name: assistant
strategy: split
variants:
- agent: assistant-sonnet
weight: 0.5
- agent: assistant-gpt
weight: 0.5
judges
- Type: list of judge configs
- Optional. Omit (or leave empty) for no automatic evaluation.
Judges are background LLM-as-judge evaluators. An agent opts in by listing judge names in its own judges: field. See LLM-as-judge evaluation for the full walkthrough.
Per-judge fields
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
name | string | yes | — | Unique judge identifier; agents refer to it here. |
provider | string | yes | — | Must match a key under providers. |
model | string | yes | — | Upstream model identifier for the judge call. |
rubrics | map<string,string> | yes | — | criterion: short description of what to assess. One score row per criterion per scored turn. Must declare at least one entry. |
sampling_rate | float | no | 1.0 | In [0.0, 1.0]. 1.0 = every turn, 0.1 ≈ 10%, 0.0 = never. |
Rubric descriptions should say what to evaluate — don't include scale, JSON, or format instructions. Coulisse forces the output shape internally (integer 0-10 per criterion with a one-sentence reasoning).
Example
judges:
- name: quality
provider: openai
model: gpt-4o-mini
sampling_rate: 1.0
rubrics:
accuracy: Factual accuracy. Flag hallucinations.
helpfulness: Whether the assistant answered the user's question.
tone: Politeness and tone.
server
- Type: object
- Optional. Omit the whole block for the defaults below.
- Purpose: how the process binds and listens.
| Field | Type | Default | Purpose |
|---|---|---|---|
bind | string (IP) | 0.0.0.0 | Interface to bind. Set 127.0.0.1 to accept loopback only (behind a reverse proxy or tunnel). |
port | integer (u16) | 8421 | TCP port. Give each coulisse.yaml its own port when running multiple instances on one machine. |
worker_threads | integer | CPU count | tokio worker-thread count. Read once at startup; changing it requires a restart. |
max_body_bytes | integer | axum 2 MiB | Largest accepted request body. Raise for big attachment uploads; lower to harden a public endpoint. |
server:
bind: 0.0.0.0
port: 8421
worker_threads: 4
max_body_bytes: 8388608 # 8 MiB
The
portfield moved here from the top level in this release. A bare top-levelport:is no longer read — nest it underserver:.
skills
- Type: object
- Optional. Omit the whole block to scan the default
./skillsdirectory. - Purpose: where reusable skill bundles are loaded from.
| Field | Type | Default | Purpose |
|---|---|---|---|
dir | string | ./skills | Directory holding one subdirectory per skill, each with a SKILL.md. A missing directory yields no skills (not an error). |
skills:
dir: ./skills
Each subdirectory with a SKILL.md becomes a skill; agents opt in by listing skill names under their own skills: array. See Skills for the SKILL.md format, bundled resource files, and the skill_file tool.
smoke_tests
- Type: list of smoke test configs
- Optional. Omit (or leave empty) for no synthetic-user runs.
Each entry pairs a persona (an LLM that role-plays the user) with a target agent or experiment. Triggered from the studio at /admin/smoke/<name>. See Smoke tests for the workflow.
Per-test fields
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
name | string | yes | — | Unique within smoke_tests. |
target | string | yes | — | Agent or experiment name. Resolved per run via the experiment router. |
persona | object | yes | — | provider, model, preamble for the role-played user. |
initial_message | string | no | — | Hard-coded first persona turn. Omit to let the persona open the conversation. |
stop_marker | string | no | — | Substring that ends the run when emitted by either side. |
max_turns | integer | no | 10 | Cap on persona-then-agent pairs per run. |
repetitions | integer | no | 1 | Independent runs launched per click. Each gets a fresh synthetic user id. |
Example
smoke_tests:
- name: jobseeker_basic
target: tremplin
persona:
provider: anthropic
model: claude-haiku-4-5-20251001
preamble: |
You are a 28-year-old looking for a developer job in Paris.
Reply like a real human; finish with "[FIN]" once satisfied.
initial_message: "Hi, I'm looking for work."
stop_marker: "[FIN]"
max_turns: 10
repetitions: 5
telemetry
- Type: object
- Optional. Omit and Coulisse runs with stderr fmt logs at
infoplus the SQLite mirror that drives the studio UI; no external traces.
The block has three sub-sections — fmt, sqlite, and otlp — each independently toggleable. See Telemetry configuration for the full schema and Telemetry & OpenTelemetry for span semantics and OTLP backend integration.
telemetry:
fmt:
enabled: true # default
sqlite:
enabled: true # default; powers the studio UI
otlp: # absent = no external traces
endpoint: "http://localhost:4317"
protocol: grpc # or http_binary
service_name: coulisse
headers:
authorization: "Bearer ${OTEL_API_KEY}"
Validation
On startup, Coulisse checks:
- All
${VAR_NAME}placeholders resolve to set environment variables. - Each present
authscope (proxy,admin) declares exactly one ofbasicoroidc. auth.<scope>.basic.passwordandauth.<scope>.basic.usernameare non-empty.auth.<scope>.oidc.client_id,issuer_url, andredirect_urlare non-empty.- There is at least one agent.
- Agent names are unique.
- Every agent's
provideris configured. - Every referenced MCP server is configured.
- Every name in
subagentsrefers to a defined agent or experiment. - No agent lists itself under
subagents. subagentsentries are unique within an agent (no duplicates).- Experiment names are unique and do not collide with any agent name.
- Each experiment declares at least one variant.
- Each variant references a defined agent and has a strictly positive
weight. - Variant agents within an experiment are unique.
- Strategy-specific fields are only set on the matching strategy (e.g.
primaryonly onshadow,metriconly onbandit). - For
shadow:primaryis set and matches one of the variants;sampling_rateis in[0.0, 1.0]. - For
bandit:metricisjudge.criterion; the judge exists, declares the criterion in its rubrics, and every variant opts into the judge;epsilonis in[0.0, 1.0]. - Every referenced judge exists.
- Judge names are unique.
- Every judge's
provideris configured and supported. - Every judge has at least one rubric.
- Every judge's
sampling_rateis in[0.0, 1.0].
Any violation fails fast with an error message that names the offending agent or judge and field.
Releasing
Coulisse follows Semantic Versioning. Pre-1.0, minor bumps may include breaking changes to the YAML schema, HTTP surface, or CLI; patch bumps will not.
Cutting a release
-
Bump the version in the workspace
Cargo.toml:[workspace.package] version = "0.2.0"All workspace crates inherit this via
version.workspace = true, so this is the only place to edit. -
Update
CHANGELOG.md— rename the## [Unreleased]section to## [0.2.0] - YYYY-MM-DDand start a fresh## [Unreleased]block above it. -
Commit, tag, push:
git commit -am "Release v0.2.0" git tag v0.2.0 git push && git push --tags
The v*.*.* tag triggers two workflows:
release.yml(cargo-dist) — builds binaries and installers for macOS (x86 + ARM), Linux GNU (x86 + ARM), and Windows MSVC, then publishes them as a GitHub Release with auto-generated notes.docker.yml— builds a multi-arch image and pushes toghcr.io/almaju/coulissetaggedlatest,0.2, and0.2.0.
Hotfixes
For patch releases on the latest minor, branch from the previous tag, fix
forward, then tag v0.2.1 from that branch. The same workflow handles it.
Roadmap
What's in Coulisse today, and what's coming.
Working today
-
Multi-agent routing via the
modelfield. -
Agents as tools — expose one agent to another under
subagents:with apurpose:description. Nested invocations are bounded by a depth cap. -
Skills — reusable instruction bundles (Claude Code / Codex style). Drop a
SKILL.mdfolder under./skills; agents opt in by name and get one progressive-disclosure tool per skill, plus askill_filereader for bundled resources. -
Per-user conversation history with isolation.
-
Long-term memory with semantic recall — persistent via SQLite and backed by a real embedder (OpenAI or Voyage AI;
hashfallback for offline dev). -
Long-term user state — opt-in
user_state: trueenables a background extractor that pulls durable facts from each exchange and deduplicates them before storing. Embedder and extraction model are auto-derived from your configured providers. -
Multi-backend support (Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq).
-
OpenAI-compatible HTTP API (
/v1/chat/completions,/v1/models). -
Studio UI at
/admin/— browse conversations, memories, and judge scores; edit agents, judges, experiments, and smoke tests live; watch the real-time task board at/admin/live. -
LLM-as-judge evaluation — background scoring of agent replies against YAML-defined rubrics, with per-judge sampling and per-user persistence.
-
Experiments (A/B testing) — wrap multiple agents under one addressable name and route traffic between them with sticky-by-user defaults. Three strategies:
split(weighted random),shadow(primary serves the user, others run in the background and are scored), andbandit(epsilon-greedy on a single judge criterion). -
Streaming responses over SSE (
stream: true, withstream_options.include_usage). -
MCP tool integration over stdio and HTTP, with per-agent filtering.
-
Per-user OAuth 2.0 for MCP servers (token vault, connect-link flow, per-user session pool).
-
Per-user token rate limiting (hour / day / month).
-
Triggers — start agents on a schedule (
cron), via HTTP POST (webhook), or on server boot (boot). -
Async task queue —
dispatch_taskenqueues background work;tasks_statusinspects the queue from chat;/admin/liveshows it in real time. -
Sidecars — long-lived helper processes (bridges, exporters) spawned and supervised by Coulisse.
-
Config variables (
vars:) — named string snippets shared across agent preambles. -
JSON Schema generation (
coulisse schema) for IDE autocompletion and live validation. -
YAML-driven config with startup validation.
-
Docker image with a volume-mounted SQLite store.
-
Credential-bound identity —
auth.proxy.identity: from_credentialderives the per-user identity from the authenticated principal (Basic username or OIDCsub) instead of trusting the request body, and rejects a mismatchedsafety_identifier. Makes adversarial multi-tenant serving safe; mutually exclusive withdefault_user_id. See User identification.
Planned
Durable rate-limit state
Current rate-limit counters live in memory — they reset on restart and don't span multiple instances. A durable, shared backend is planned so quotas survive reboots and horizontal scaling.
Vector index for large memory stores
Recall currently does a linear cosine scan over all memories for the user. Fine at hundreds-to-low-thousands of memories per user, but a vector index will be needed if per-user memory counts grow into the tens of thousands.
Per-agent memory overrides
Today the memory: block is global. A future revision will allow per-agent scoping (different embedders or budgets per agent) for cases where one agent handles long-form research and another handles short user chat.
This list reflects what's on deck at the time of writing — check the repository for the current state.