Coulisse

One YAML file. An OpenAI-compatible server with memory, tools, and multi-backend routing.

Coulisse is a single Rust binary that reads a coulisse.yaml file and spins up an OpenAI-compatible HTTP server. You point your existing tools, SDKs, and projects at it like any other OpenAI endpoint — and everything configurable lives in that one YAML file.

Why Coulisse?

Every multi-agent project ends up re-implementing the same plumbing:

  • Per-user conversation memory
  • Routing between model providers
  • Rate limits and retries
  • Tool integration
  • Multiple agents with different system prompts

Coulisse collapses this plumbing into one configurable server. You describe the setup in YAML and pilot the whole thing from there, instead of writing glue code for each prototype.

How it works

┌──────────────────┐        ┌──────────────────┐        ┌──────────────────┐
│  Your SDK / app  │───────▶│     Coulisse     │───────▶│   Anthropic      │
│  (OpenAI client) │        │                  │        │   OpenAI         │
└──────────────────┘        │   coulisse.yaml  │        │   Gemini …       │
                            │                  │        └──────────────────┘
                            │   + memory       │
                            │   + MCP tools    │        ┌──────────────────┐
                            │   + per-user     │───────▶│   MCP servers    │
                            └──────────────────┘        └──────────────────┘
  1. Your application talks to Coulisse using any OpenAI-compatible SDK.
  2. Coulisse picks the agent you asked for (by model name), assembles the user's memory, and calls the right backend.
  3. The response flows back — and the exchange is saved to that user's memory for next time.

What's in the box

FeatureStatus
Multi-agent routing✅ Working
Per-user memory✅ Persistent (SQLite) with semantic recall
Real embedders✅ OpenAI + Voyage (hash fallback for offline dev)
Auto-extraction✅ Optional — pulls durable facts from each exchange
MCP tool integration✅ Working (stdio + HTTP)
Multi-backend support✅ Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq
OpenAI-compatible API/v1/chat/completions, /v1/models
Streaming responses✅ Server-Sent Events
Rate limiting✅ Per-user token quotas (hour / day / month, in-memory)
Studio UI/admin/ — conversations, memories, judges, live task board, admin edits
Triggers (cron / webhook / boot)✅ Start agents on a schedule or via HTTP POST
Async task queue✅ Fire-and-forget background work with dispatch_task
Sidecars✅ Long-lived helper processes managed by Coulisse
Config variables (vars:)✅ Named snippets shared across agent preambles
IDE schema (coulisse schema)✅ JSON Schema for autocompletion in VS Code, Helix, Zed…
Durable rate-limit state⏳ Planned

Continue to Installation to get started.

Stability

Coulisse is pre-1.0. It follows Semantic Versioning, but during the 0.x phase, minor version bumps (0.1 → 0.2) may include breaking changes to the YAML schema, HTTP surface, or CLI. Patch bumps (0.1.0 → 0.1.1) will not. See the Releasing chapter and CHANGELOG.md for the version history.

Installation

Coulisse is a single Rust binary. Install it from a prebuilt release or build from source.

Requirements

  • A valid API key for at least one supported provider

Install from a release

The latest GitHub Release ships installers for macOS (x86 + ARM), Linux GNU (x86 + ARM), and Windows MSVC.

macOS / Linux:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/Almaju/coulisse/releases/latest/download/coulisse-installer.sh | sh

Windows (PowerShell):

powershell -ExecutionPolicy Bypass -c "irm https://github.com/Almaju/coulisse/releases/latest/download/coulisse-installer.ps1 | iex"

The installer drops the coulisse binary on your PATH.

Build from source

Requires Rust (edition 2024) — install from rustup.rs.

git clone https://github.com/Almaju/coulisse.git
cd coulisse
cargo build --release

The binary lands at target/release/coulisse. Drop it on your PATH (or alias it) so the rest of this guide can call it as coulisse.

Initialize a config

coulisse init

This writes a minimal coulisse.yaml in the current directory: one OpenAI agent, sqlite memory, the offline hash embedder. Run coulisse init --from-example instead for the full annotated tour covering every section.

Edit the file to set your provider API key.

Start the server

coulisse start

start runs the server detached: it returns immediately and the process keeps running in the background. Stop it later with coulisse stop.

To run attached (logs streaming to your terminal), use coulisse start --foreground — or just coulisse with no subcommand. Either form binds port 8421.

You should see a startup banner like:

  coulisse 0.1.0

  Proxy   →  http://localhost:8421/v1
  Admin   →  http://localhost:8421/admin

  Memory     sqlite at .coulisse/coulisse-memory.db; user_state: disabled (history only)
  Auth       proxy: open · admin: open

  Agents (1)
    assistant  openai / gpt-4o-mini

The exact lines depend on your config — what matters is that memory, auth, and every configured agent are each acknowledged on startup.

Next: write your first config, or read the CLI reference for every subcommand.

Your first config

A minimal coulisse.yaml has two things: a provider (where to send model calls) and an agent (how to call it).

providers:
  anthropic:
    api_key: ${ANTHROPIC_API_KEY}

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You are a helpful assistant.

Save this as coulisse.yaml in your working directory, then run coulisse.

What each piece does

providers

A map of provider kind → credentials. The key must be one of the supported kinds (see Providers). You only need to list the providers you actually use.

API keys (and any other string values) can be read from environment variables using ${VAR_NAME} — Coulisse expands them before parsing the YAML. If a referenced variable is unset, the server refuses to start and names the missing variable. See the YAML reference for details.

agents

A list of agents. Each agent is a named recipe:

  • name — the identifier. Clients ask for the agent by this name via the model field in their request.
  • provider — which configured provider to route to.
  • model — the upstream model identifier to call (e.g. claude-sonnet-4-5-20250929, gpt-4o).
  • preamble — optional system prompt prepended to every conversation.

You can define as many agents as you want — see Multi-agent routing for what that unlocks.

Adding more

Want a code reviewer, a pirate, and a tool-using agent? Just add more entries:

providers:
  anthropic:
    api_key: ${ANTHROPIC_API_KEY}
  openai:
    api_key: ${OPENAI_API_KEY}

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You are a helpful assistant.

  - name: code-reviewer
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: |
      You are a thorough code reviewer. Focus on correctness,
      clarity, and security.

  - name: gpt-assistant
    provider: openai
    model: gpt-4o
    preamble: You are a helpful assistant.

Restart the server — all three agents are now selectable by model name.

Next: make a request.

Making a request

Coulisse exposes an OpenAI-compatible API, so any OpenAI SDK works. Point the client at http://localhost:8421/v1 and set the model field to an agent name from your config.

curl

curl http://localhost:8421/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "assistant",
    "safety_identifier": "user-123",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8421/v1",
    api_key="not-needed",  # Coulisse doesn't check this
)

response = client.chat.completions.create(
    model="assistant",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"safety_identifier": "user-123"},
)

print(response.choices[0].message.content)

TypeScript / JavaScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8421/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "assistant",
  messages: [{ role: "user", content: "Hello!" }],
  // @ts-expect-error — extra field passed through
  safety_identifier: "user-123",
});

console.log(response.choices[0].message.content);

The safety_identifier field

Coulisse identifies users through the safety_identifier field (or the deprecated user field, which works too). The identifier is what keeps each user's conversation history isolated.

You can turn this off — see User identification — but by default every request needs one.

Listing available agents

curl http://localhost:8421/v1/models

Returns every agent you've defined, in OpenAI's model-list format.


That's the whole loop. Next, dig into how to configure providers.

Providers

Providers are where your model calls actually go. Configure each provider once with its credentials; reference it by name from any number of agents.

Supported providers

KindConfig key
Anthropicanthropic
Coherecohere
Deepseekdeepseek
Geminigemini
Groqgroq
OpenAIopenai

Shape

providers:
  anthropic:
    api_key: sk-ant-...
  openai:
    api_key: sk-...
  gemini:
    api_key: ...

Each provider takes a single field: api_key. You only need to list the providers you plan to use — unused ones can be omitted entirely.

Validation

When Coulisse loads your config, it checks that every agent's provider field matches a key under providers. Misspell a provider and startup fails with a clear error:

agent 'assistant' references provider 'antropic' which is not configured

Switching providers

Because providers are referenced by name, switching an agent from one backend to another is a one-line change:

agents:
  - name: assistant
    provider: anthropic            # ← change this …
    model: claude-sonnet-4-5-20250929   # ← … and this
    preamble: You are helpful.

No client code changes, no redeployment of downstream apps. See Multi-backend support for more on mixing providers.

Agents

Agents are the named personas clients can talk to. Each agent pins down:

  • Which provider to call
  • Which upstream model to ask for
  • What system prompt to prepend
  • Which tools (if any) to expose

Shape

agents:
  - name: code-reviewer
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: |
      You are a thorough code reviewer. Focus on correctness,
      clarity, and security. Point out subtle bugs and suggest
      concrete improvements.
    mcp_tools:
      - server: hello
        only:
          - say_hello

Fields

name (required)

The agent identifier. Clients select this agent by passing name as the model field in their request. Names must be unique across the config.

provider (required)

Must match a key under the top-level providers map. Tells Coulisse which backend to route through.

model (required)

The upstream model identifier. This is provider-specific — e.g. claude-sonnet-4-5-20250929 for Anthropic, gpt-4o for OpenAI, gemini-2.0-flash for Gemini.

preamble (optional)

A system prompt prepended to every conversation this agent handles. Use it to define tone, expertise, constraints, output format — anything you'd normally put in a system message.

Defaults to empty. YAML block scalars (|) are handy for multi-line preambles.

judges (optional)

A list of judge names (from the top-level judges: block) that evaluate this agent's replies in the background. Empty or omitted = no evaluation. See LLM-as-judge evaluation for the full story.

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    judges: [quality, deep_audit]

mcp_tools (optional)

A list of MCP servers and tools this agent is allowed to use. See MCP tools for the full story.

mcp_tools:
  - server: hello           # all tools from "hello"
  - server: calculator      # all tools from "calculator"
    only:                   # …but only these specific ones
      - add
      - multiply

skills (optional)

Names of skills from the top-level skills: directory this agent can use. Each listed skill becomes a tool: its description is advertised to the model, and calling it returns the skill's full instructions. Names not present in the catalog are silently ignored.

skills: [resume-review, salary-negotiation]

See Skills for the full walkthrough.

max_turns (optional)

Maximum number of tool-calling rounds per turn before Coulisse returns the last response. Defaults to 8. Raise it for agents that chain many tool calls in one go (e.g. a coder reading files, editing, handing off to QA).

max_turns: 16

subagents (optional)

A list of other agent names exposed to this agent as callable tools. When the agent's model decides to invoke one, Coulisse starts a fresh conversation against that agent and returns its final message as the tool result.

subagents: [onboarder, resume_critic]

Each name must refer to another entry under agents. Self-reference and duplicates are rejected at startup. Nested invocations are capped at depth 4 to prevent runaway loops. See Multi-agent routing for the full walkthrough.

purpose (optional)

A short tool description shown to other agents when this one is listed under their subagents. Keep it concrete — it's how a calling agent's model decides when to invoke this specialist. Omit it for agents that are only used directly by clients (never as subagents); fall back is "Invoke the '<name>' subagent." but a hand-written purpose is what makes multi-agent orchestration reliable.

purpose: Critique and rewrite a resume for a target role.

Runtime overrides

Agents can also be created, edited, and disabled at runtime through the admin UI or HTTP without touching coulisse.yaml. These runtime entries live in the SQLite database alongside conversation memory and judge scores; the YAML file is never modified by the server.

The resolution rule is simple: when a name is requested, the database is checked first. If a row exists there, it wins. Otherwise the YAML entry (if any) is used. A row can also be a tombstone — a marker that disables a YAML-declared name without removing it from the file.

Each runtime row carries a label visible in the admin UI:

  • yaml — the agent comes from coulisse.yaml, no database row exists.
  • dynamic — created via the admin UI or HTTP; no YAML entry of this name.
  • override — both YAML and the database define this name; the database version is what runs.
  • tombstoned — a database row disables this name; the agent is hidden from clients even if YAML still declares it.

A "Reset to YAML" action on an override deletes the database row, letting the YAML version reassert. The same action on a tombstoned row re-enables the agent. Database edits never modify the YAML file: if you want a change to survive a database wipe, edit the YAML.

Several agents, one config

Define as many agents as you want. A common pattern is having variants of the same model with different preambles:

agents:
  - name: friendly
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You are warm and encouraging.

  - name: terse
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: Reply in one sentence. No preamble, no filler.

  - name: pirate
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: Respond exclusively as a pirate, arrr.

Clients switch between them by changing the model field — no server redeploy, no code change.

Memory

Coulisse remembers two things automatically:

  1. Conversation history — every message in every turn, per user. Always on.
  2. User state — durable facts and preferences extracted from those conversations and recalled into future prompts. Off by default; one line of YAML turns it on.

Quick start

The simplest possible memory config is no config at all — omit the memory: block and you get:

  • Conversation history kept in SQLite at .coulisse/coulisse-memory.db.
  • Long-term user state off.

To turn on long-term user state, that's the only line you write:

memory:
  user_state: true

Now Coulisse will, after each turn:

  • Ask a small "haiku-tier" model what's worth remembering about the user.
  • Embed those facts and store them.
  • On future requests, recall the most relevant ones and inject them into the prompt as a Known about the user: block.

You don't pick the embedder or the extraction model — Coulisse derives both automatically from your providers: block. (See auto-derivation below for the rules.)

What gets injected into the prompt

When user state is on, every request to an agent gets a system message like:

Known about the user:
- [fact] lives in Paris
- [preference] prefers WhatsApp-style short answers

…inserted after your agent's preamble and before the conversation history.

Where data lives

There is nothing to configure. The database is always .coulisse/coulisse-memory.db — the project state directory next to your coulisse.yaml, alongside the log, PID, MCP secrets, and uploaded files. Created on first boot if missing.

For Docker, mount the .coulisse/ directory on a volume so it survives container restarts.


Advanced

You usually don't need any of this. Skip unless you have a specific reason — defaults are picked to "just work" for the common case.

Picking the extraction model explicitly

By default Coulisse picks the cheapest available model from your providers:. To pin one:

memory:
  user_state:
    learn_from:
      provider: anthropic
      model: claude-haiku-4-5-20251001

Picking the embedder explicitly

memory:
  user_state:
    embed_with:
      provider: voyage
      model: voyage-3.5
      api_key: pa-...               # required for Voyage

Voyage is the only embedder that needs an explicit API key here — openai reuses the key from your top-level providers.openai entry.

Recall and dedup tuning

memory:
  user_state:
    recall_k: 5             # how many facts to recall per request
    dedup_threshold: 0.9    # cosine similarity above which a "new" fact is dropped
    max_facts_per_turn: 5   # cap on facts written per exchange

Auto-derivation

When user_state: true (or when fields under user_state: are omitted):

  • Embedder. If openai is in your providers:, Coulisse uses text-embedding-3-small and reuses the OpenAI key. Otherwise it falls back to the offline hash embedder (deterministic, no semantic understanding — fine for tests, never for production).
  • Extraction model. Coulisse picks the first configured provider in this priority order — anthropicopenaigeminigroqdeepseekcohere — and uses its known cheap model (e.g. claude-haiku-4-5-20251001, gpt-4o-mini).

If user_state: true but you have no providers configured, Coulisse refuses to start with a clear error.

Supported embedder models

  • openai: text-embedding-3-small (1536 dims, default), text-embedding-3-large (3072 dims), text-embedding-ada-002 (1536 dims).
  • voyage: voyage-3.5 (1024, default), voyage-3-large (1024), voyage-3.5-lite (1024), voyage-code-3 (1024), voyage-finance-2 (1024), voyage-law-2 (1024), voyage-code-2 (1536).
  • hash: any positive dims (default 32). Offline only.

Unknown model names fail at startup with a clear error.

Disabling user state

Either omit the user_state: field entirely or set it to false:

memory:
  user_state: false

When disabled, Coulisse keeps conversation history but performs no extraction and no recall.

Example configs

Anthropic only — auto-everything

providers:
  anthropic:
    api_key: sk-ant-...

memory:
  user_state: true

Auto-resolution: extraction uses claude-haiku-4-5-20251001, embeddings fall back to the offline hash embedder (because Voyage needs an explicit api_key).

OpenAI end-to-end

providers:
  openai:
    api_key: sk-...

memory:
  user_state: true

Auto-resolution: extraction uses gpt-4o-mini, embeddings use text-embedding-3-small with the OpenAI key.

Anthropic completions + Voyage embeddings

providers:
  anthropic:
    api_key: sk-ant-...

memory:
  user_state:
    embed_with:
      provider: voyage
      model: voyage-3.5
      api_key: pa-...

Offline dev — no external calls

Omit the memory: block entirely (or set user_state: false): conversation history is kept on disk under .coulisse/, with no extraction or embedding API calls. Delete the database any time with coulisse reset.

MCP tools

Coulisse can borrow tools from Model Context Protocol servers and hand them to your agents. The config has one rule: declare what the server is, not what protocol it speaks. Coulisse infers the transport from the shape of the entry.

Declaring MCP servers

mcp:
  # Remote MCP — just paste the URL. OAuth is auto-enabled.
  todoist:
    url: https://ai.todoist.net/mcp

  # Local stdio MCP — give it a command.
  hello:
    command: uvx
    args:
      - --from
      - git+https://github.com/macsymwang/hello-mcp-server.git
      - hello-mcp-server

  # Plain HTTP MCP without auth — explicit opt-out.
  calculator:
    url: http://localhost:8080
    oauth: false

The Todoist entry above is zero config: the same UX as ChatGPT. Paste the URL, and Coulisse runs RFC 8414 discovery + RFC 7591 Dynamic Client Registration on first use, mints a per-user connect link, stores the token in the vault.

You never write oauth: for the common case — a url: server gets discover-mode OAuth on its own. Reach for the oauth: map only to override scopes or use static credentials:

mcp:
  # Override discovered scopes
  custom:
    url: https://example.com/mcp
    oauth:
      scopes: [read:items, write:items]   # mode: discover is implied

  # Pre-registered (static) OAuth credentials — for providers that
  # don't support Dynamic Client Registration
  legacy:
    url: https://internal.example.com/mcp
    oauth:
      mode: static
      authorization_url: https://auth.example.com/authorize
      token_url: https://auth.example.com/token
      client_id: my-client
      client_secret: my-secret
      redirect_uri: http://localhost:8423/mcp/legacy/oauth/callback

That's it. No transport: field, no oauth: block for the common case, no shim wrappers. Coulisse figures out:

  • url: present → HTTP/SSE transport (SSE if the URL path contains /sse, otherwise streamable HTTP).
  • command: present → stdio transport, with optional args: / env: for the child process.
  • oauth: is the only thing you opt into yourself, and only when the server actually needs it.

Auto-detected transport

The path heuristic: if the URL has an /sse path segment (https://mcp.atlassian.com/v1/sse), Coulisse uses the older MCP-over-SSE protocol. Everything else uses streamable HTTP. URLs without /sse that turn out to be SSE-only will fail with a Missing sessionId parameter 404 on first call; switch to the explicit form below.

stdio config fields

  • command (required) — the executable to spawn (uvx, python, node, …)
  • args (optional) — arguments
  • env (optional) — environment variables
mcp:
  my-tool:
    command: python
    args: [-m, my_mcp_server]
    env:
      DEBUG: "1"
      API_KEY: abc123

Explicit transport: (legacy / override)

The verbose form still works if you need to override the auto-detection:

mcp:
  legacy:
    transport: sse           # one of: http, sse, stdio
    url: https://example.com/v2/endpoint    # despite no /sse segment

Existing YAMLs that use transport: continue to parse unchanged. New code should prefer the URL-only / command-only form above.

Per-user OAuth (optional)

MCP servers that require user-delegated credentials (Todoist, Atlassian, GitHub, Google Drive, etc.) can be configured with an oauth: block. Coulisse handles the authorization flow per-user and injects each user's token automatically at call time — Alice's token is never reachable by Bob.

Two modes:

Spec-compliant MCP servers (Todoist, Atlassian, Linear, …) advertise their OAuth endpoints via /.well-known/oauth-authorization-server and accept Dynamic Client Registration. Coulisse discovers + registers itself lazily, on the first user to authorise. No credentials in YAML — and no oauth: block at all, since a URL-based server defaults to discover:

mcp:
  todoist:
    url: https://ai.todoist.net/mcp
    # discover OAuth is automatic; add an oauth: map only to pin scopes:
    # oauth:
    #   scopes: [data:read_write]

A handful of servers only honour tokens issued to mcp-remote's grandfathered client id and reject fresh DCR registrations (Todoist's hosted MCP is the current example). For those, run mcp-remote yourself as a stdio child — there is no special flag:

mcp:
  todoist:
    command: npx
    args: [-y, mcp-remote, https://ai.todoist.net/mcp]

mode: static (for non-DCR providers)

For OAuth providers that require a pre-registered app (GitHub OAuth apps, classic Atlassian Connect, etc.):

mcp:
  github:
    transport: http
    url: https://api.githubcopilot.com/mcp
    oauth:
      mode: static
      authorization_url: https://github.com/login/oauth/authorize
      client_id: "${GH_CLIENT_ID}"
      client_secret: "${GH_CLIENT_SECRET}"
      redirect_uri: https://coulisse.example.com/mcp/github/oauth/callback
      scopes: [repo, read:user]
      token_url: https://github.com/login/oauth/access_token

static requires: authorization_url, client_id, client_secret, redirect_uri, token_url. Missing any of these at startup is a fatal error.

Both modes share the same infrastructure secrets (vault encryption + HMAC). Coulisse auto-generates them on first boot and persists them in .coulisse/secrets.env — no manual setup needed for local use. Override with COULISSE_VAULT_KEY / COULISSE_HMAC_KEY env vars for hosted deployments. auth.mcp_consumer_secret is optional (only gates the admin POST /connect-link endpoint). Set public_base_url: at the top level when Coulisse runs on a public hostname; defaults to http://localhost:{port} for local use.

See Per-user OAuth for MCP for the full flow, endpoints, secrets resolution, and the security trust-model warning.

Granting tool access to agents

An agent only sees tools you explicitly give it. Reference the server name under mcp_tools:

agents:
  - name: helper
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    mcp_tools:
      - server: hello           # all tools from "hello"

Restrict to a subset with only:

    mcp_tools:
      - server: hello
        only:
          - say_hello           # only this tool, nothing else

Discovering tool names

On startup Coulisse connects to each non-OAuth MCP server and logs the tools it discovered. OAuth-enabled servers connect per-user on first use. Tool names in your only list must match what the server advertises — check the startup output or the server's own docs.

How tool calls work

When a request arrives for an agent with tools:

  1. Coulisse collects the agent's allowed tools from the MCP servers.
  2. It forwards them to the model as tool definitions.
  3. If the model calls a tool, Coulisse dispatches to the MCP server and feeds the result back.
  4. This loops until the model produces a final answer (up to 8 turns by default, configurable via the agent's max_turns field).

Your client doesn't see any of this — the tool loop is invisible, and only the final assistant message is returned.

See MCP tool integration for a full walkthrough.

Multi-agent routing

Coulisse lets you define multiple agents and route between them with nothing more than the model field of a request. No extra endpoints, no custom headers, no proxy tricks.

Why it matters

Most apps end up needing more than one model configuration:

  • A fast, cheap agent for classification and quick replies.
  • A heavier agent for hard reasoning.
  • A specialized agent (code reviewer, translator, summarizer) with a tuned preamble.
  • A tool-using agent that can reach into an MCP server.

Without something like Coulisse, that means either multiple deployments or a growing pile of if (mode === ...) switches inside your app.

The pattern

Declare each variant as a separate agent:

agents:
  - name: triage
    provider: anthropic
    model: claude-haiku-4-5-20251001
    preamble: Classify the user's intent. Reply with a single word.

  - name: reasoner
    provider: anthropic
    model: claude-opus-4-7
    preamble: You are a careful reasoner. Think step by step.

  - name: translator
    provider: openai
    model: gpt-4o
    preamble: Translate the user's message into French.

Your application picks which agent to call by setting the model field:

fast  = client.chat.completions.create(model="triage", ...)
smart = client.chat.completions.create(model="reasoner", ...)
fr    = client.chat.completions.create(model="translator", ...)

What each agent brings to the request

When a request arrives, Coulisse:

  1. Looks up the named agent.
  2. Prepends the agent's preamble as a system message.
  3. Resolves the agent's allowed MCP tools (if any).
  4. Forwards the call to the agent's configured provider and model.
  5. Records the exchange in the caller's per-user memory.

Changing agents is free — you don't need to redeploy anything on the client side.

Discovering agents at runtime

GET /v1/models returns every agent in the config in OpenAI's standard model-list format. Useful for UIs that want to populate a model picker from the server:

curl http://localhost:8421/v1/models

Subagents: agents as tools

Routing by model lets the client pick an agent per request. Sometimes you want one agent to call another from within a turn, so the conversation stays with the top-level agent while specialists handle focused sub-tasks. Coulisse exposes this via the subagents field.

agents:
  - name: onboarder
    provider: anthropic
    model: claude-haiku-4-5-20251001
    purpose: Collect the user's profile — first name, last name, phone, goals.
    preamble: |
      Ask the user for any missing profile field. Keep questions short.

  - name: resume_critic
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    purpose: Critique and rewrite a resume for a target role.
    preamble: |
      Given a resume and a target role, return a revised resume and
      a bullet list of the biggest gaps to address.

  - name: career_coach
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    subagents: [onboarder, resume_critic]
    preamble: |
      Guide the user. Delegate to `onboarder` if the profile is
      incomplete, and `resume_critic` when they want resume work.

When career_coach runs, the onboarder and resume_critic agents appear in its tool list alongside any MCP tools. If the model calls onboarder, Coulisse starts a fresh conversation against that agent with just the message it was given — the onboarder sees its own preamble and its own MCP tools, nothing inherited from the parent. The onboarder's final assistant message is returned to the coach as the tool result.

The purpose field

purpose is the tool description shown to the calling agent. It's how the coach's LLM decides whether this subagent is the right choice for the current turn. Keep it short and concrete — "Critique and rewrite a resume for a target role" is good; "Helpful assistant" is useless.

If purpose is absent, Coulisse falls back to "Invoke the '<name>' subagent." — functional, but a clear purpose is what makes orchestration reliable.

Bounded recursion

Calling a subagent is itself a tool call — the subagent can have its own subagents, which can have their own, and so on. To prevent a pathological A → B → A → … loop from burning tokens, Coulisse caps nested invocations at depth 4. Going over returns a clear error that the parent agent sees and can react to.

Fresh context

Every subagent invocation starts with a new conversation. The subagent does not see the parent's message history, the user's original request, or any other sibling subagent's output. It gets only the message the parent passed when calling it, plus its own preamble.

This isolation is deliberate. It keeps subagents focused, prevents context bloat, and makes each subagent's behavior reproducible in isolation. If you want data to flow between agents, store it in an MCP server and have both agents read it — Coulisse owns no cross-agent state.

Why subagents and MCPs live side by side

mcp_tools and subagents both appear in an agent's tool list, but they model different things:

  • An MCP tool is a stateless function call against an external server: fixed schema, data in and data out.
  • A subagent is another LLM conversation that happens to be kicked off by a tool call. It has its own preamble, its own tool loop, and can itself delegate further.

Reach for mcp_tools when the work is a concrete operation (save a record, search a database, send an email). Reach for subagents when the work needs its own LLM reasoning under a different preamble.

MCP tool integration

Coulisse is a client for Model Context Protocol servers. Any MCP-compliant tool — a calculator, a filesystem browser, a REST API wrapper, your in-house data fetcher — becomes usable by any agent with a one-line config change.

End-to-end example

Imagine a small MCP server that exposes a say_hello tool. Register it and hand it to an agent:

providers:
  anthropic:
    api_key: sk-ant-...

mcp:
  hello:
    transport: stdio
    command: uvx
    args:
      - --from
      - git+https://github.com/macsymwang/hello-mcp-server.git
      - hello-mcp-server

agents:
  - name: greeter
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You greet people warmly.
    mcp_tools:
      - server: hello

Start the server. On boot you'll see Coulisse discover the server's tools and note them in the log.

Now the greeter agent can call say_hello whenever the model decides it's useful. Your client makes a normal chat completion request:

curl http://localhost:8421/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "greeter",
    "safety_identifier": "user-1",
    "messages": [
      {"role": "user", "content": "Please greet Alice."}
    ]
  }'

The model may call the tool one or more times; Coulisse runs the tool loop internally and returns only the final assistant message.

Under the hood, every invocation — tool name, arguments, result (or error) — is recorded against the assistant message that produced it, so you can replay the turn in the studio UI and see which tools fired and what came back. This is tool-call capture for debugging, not an extension of the OpenAI surface: the wire response your SDK receives is unchanged.

Transports

  • stdio — good for local MCP servers you spawn yourself (Python scripts, Node programs, CLI tools). Coulisse manages the child process.
  • http — good for long-running MCP services, especially ones shared across multiple Coulisse instances.

Both are configured the same way conceptually; see MCP tools for fields.

Scoping tools per agent

Different agents can see different subsets of tools, even from the same server:

agents:
  - name: power-user
    mcp_tools:
      - server: filesystem      # every tool the filesystem server offers

  - name: read-only
    mcp_tools:
      - server: filesystem
        only:
          - read_file
          - list_files          # write / delete tools aren't exposed

This is Coulisse-side filtering — the model never sees the excluded tools, so it can't call them.

Tool loop limits

Coulisse caps a single request at 8 tool-call turns by default. If the model hasn't produced a final answer by then, the request ends. This keeps runaway loops from billing you forever. You can raise or lower the limit per agent with the max_turns field (see the agents YAML reference).

Capture limitations

Tool-call capture only runs on the streaming path — every OpenAI SDK uses streaming for chat completions by default, so this covers normal usage. Non-streaming requests ("stream": false) still execute tools correctly; their invocations just aren't captured for the studio trail, because rig's non-streaming API doesn't expose intermediate events.

If a client disconnects mid-stream after a tool call has fired but before the result lands, the call is persisted with result: null so the studio UI still shows that the attempt happened.

Per-user OAuth

For MCP servers that require each user to authorize access with their own account (Jira, GitHub, Google, etc.), see Per-user OAuth for MCP.

Multi-backend support

Coulisse speaks to six providers out of the box:

  • Anthropic
  • OpenAI
  • Gemini
  • Cohere
  • Deepseek
  • Groq

You can mix them freely in a single config.

Why mix backends?

  • Cost tiering. Run quick tasks on a cheap model (Groq, Haiku, gpt-4o-mini), hard tasks on a flagship.
  • Capability routing. Some tasks benefit from a specific provider's strengths — long-context summarization on Gemini, coding on Sonnet, reasoning on Opus.
  • Redundancy. If one provider has an outage, flip a single provider field to route through another.
  • Evaluation. A/B the same preamble on two different models without changing any client code.

One config, many backends

providers:
  anthropic:
    api_key: sk-ant-...
  openai:
    api_key: sk-...
  gemini:
    api_key: ...
  groq:
    api_key: ...

agents:
  - name: quick
    provider: groq
    model: llama-3.3-70b-versatile
    preamble: Answer briefly.

  - name: smart
    provider: anthropic
    model: claude-opus-4-7
    preamble: Think carefully.

  - name: long-context
    provider: gemini
    model: gemini-2.0-flash
    preamble: You excel at synthesizing long documents.

Your client picks one by name — everything else stays the same.

The client side is unchanged

Because Coulisse exposes an OpenAI-compatible API no matter which provider is behind an agent, your client code never has to know. You don't install the Anthropic SDK, Gemini SDK, and OpenAI SDK side by side — you just use the OpenAI SDK and change the model field.

Streaming responses

Coulisse implements OpenAI's Server-Sent Events (SSE) format for chat completions. Set stream: true in the request and the server emits incremental chat.completion.chunk frames over the wire — drop-in compatible with the OpenAI Python and JavaScript SDKs and any client that already speaks the OpenAI streaming protocol.

Asking for a stream

Add stream: true to a normal /v1/chat/completions request:

{
  "model": "assistant",
  "safety_identifier": "user-123",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": true
}

The response is text/event-stream instead of JSON. Each frame is one chat.completion.chunk.

Wire format

The first frame announces the assistant role:

data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"role":"assistant"}}]}

Then one frame per text delta:

data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":" there"}}]}

A terminal frame sets finish_reason:

data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Including token usage

Set stream_options.include_usage: true to receive a usage field on the terminal chunk:

{
  "model": "assistant",
  "messages": [{"role": "user", "content": "Hi"}],
  "stream": true,
  "stream_options": {"include_usage": true}
}

The terminal frame then carries usage:

data: {"...":"...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"completion_tokens":3,"prompt_tokens":7,"total_tokens":10}}

When include_usage is missing or false, the field is omitted — matching OpenAI's contract.

Memory and rate limiting

Streaming responses use the same per-user memory bucket and rate-limit accounting as non-streaming requests:

  • The user's message and the assistant's reply are appended to memory after the stream ends.
  • Token usage is recorded against the rate-limit window when the stream ends.
  • If the client disconnects mid-stream, Coulisse persists the partial assistant reply (everything received before the disconnect). This matches what the user actually saw — the next turn won't claim the model said something the user never received.

Tool-using agents

Agents with MCP tools attached stream the same way. Tool-call internals run inside the rig multi-turn loop and are not surfaced to the client; you'll see a pause while a tool runs, then the model's text continues. The delta.content field is the only delta variant Coulisse currently emits.

Subagent handoff events

When an agent delegates to a subagent, the stream doesn't go silent — Coulisse signals the handoff so your UI can show meaningful feedback instead of a frozen spinner.

handoff_started

Emitted immediately before the subagent is invoked:

event: handoff_started
data: {"agent":"resume_critic"}

Use this to update your UI: "Passing to resume_critic…" is better than a silent spinner.

Heartbeat

While a subagent is running, Coulisse emits a keep-alive comment every 20 seconds:

: heartbeat

This is a standard SSE comment (lines starting with :). Most SSE clients ignore it automatically — it exists to prevent proxies and load balancers from closing the connection during long subagent turns.

If your SSE stream goes silent for more than 20 seconds during a subagent turn, that's a bug — open an issue.

Sequence during a handoff

# Parent agent starts responding
data: {"choices":[{"delta":{"content":"Let me get the resume critic on this."}}]}

# Handoff announced
event: handoff_started
data: {"agent":"resume_critic"}

# Heartbeats while subagent works
: heartbeat
: heartbeat

# Subagent result flows back as normal content
data: {"choices":[{"delta":{"content":"Here's the revised resume…"}}]}

data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]

The subagent's internal turns are not surfaced — you only see the final result as delta.content from the parent.

Errors mid-stream

If the upstream provider fails after the stream has started, Coulisse emits one terminal frame containing an error field with the failure reason, then [DONE]. The HTTP status is already 200 by then — clients should check for the error field on the final chunk.

Studio UI

Coulisse ships a studio UI for browsing the conversations and memories the server has seen, and for editing the live YAML config. It's served by the same binary, under /admin/.

Point a browser at http://localhost:8421/admin/ while the server is running, or run coulisse studio (alias coulisse admin) to open it for you.

What you can do

  • List every user the server has seen, most recent activity first, with message and memory counts.
  • Open a user to see their full conversation (user, assistant, and system messages) with per-message token counts and relative timestamps.
  • See every tool invocation that happened during each assistant turn — rendered inline in the conversation as a collapsed block above the assistant bubble. Expand to see the args, the result (or error body), and a badge marking MCP vs subagent calls. This is the debug view for figuring out what the agent tried and what came back.
  • Open the per-turn Telemetry block under any assistant message to see the full causal tree that produced it: every tool call (MCP or subagent) at every depth, with args, result, error, and duration. Unlike the inline top-level tool calls, the telemetry tree also surfaces tool calls made inside subagents — so when a subagent's MCP call fails, the real error is right there instead of being paraphrased into the assistant's text.
  • See the long-term memories recalled for that user, tagged as fact or preference.
  • See the LLM-as-judge scores for that user, including mean score per (judge, criterion) and the most recent individual scores with reasoning.
  • Browse configured experiments at /admin/experiments — strategy, sticky-by-user flag, per-variant weights, and bandit-strategy mean scores live-loaded from judges.
  • Run smoke tests at /admin/smoke — a synthetic-user persona drives a real conversation against any agent or experiment, scores fan out through the same judge pipeline, and the run viewer shows the full transcript with persona/assistant turns side by side. Useful for iterating on agent prompts without writing test scaffolding.
  • Mint, monitor, and revoke API tokens at /admin/tokens — issue sk-coulisse-… keys for the /v1/* proxy, each bound to a principal and a spend budget (unlimited, lifetime, or per-month). The list shows current-period and lifetime spend per token; the create form reveals the secret once. See API tokens.
  • Edit, add, or disable agents, judges, experiments, and smoke tests at /admin/agents, /admin/judges, /admin/experiments, and /admin/smoke. Each form is a YAML textarea over the same config shape used in coulisse.yaml. Edits and creations write to the database, never to coulisse.yaml; runtime resolution checks the database first, then falls back to YAML. List views label each row as yaml, dynamic (database-only), override (database shadows YAML), or tombstoned (disabled). Override rows expose a "Reset to YAML" action that drops the database row so the YAML version reasserts. See Agents → Runtime overrides for the full semantics — judges, experiments, and smoke tests follow the same model.
  • Configure infrastructure from the Settings hub at /admin/settings. Each card — providers, MCP servers, memory, telemetry, auth, storage — links to its own editor (/admin/providers, /admin/mcp, /admin/memory, /admin/telemetry, /admin/auth, /admin/storage). Unlike agents/judges/experiments/smoke, these sections write straight to coulisse.yaml (there is no database shadow) and apply after restart. The whole file is validated before anything touches disk, so an invalid edit is rejected and the running config keeps serving.
  • Edit the raw coulisse.yaml at /admin/config/edit — a full-file YAML textarea backed by PUT /admin/config. The power-user escape hatch when you want to change several sections at once or touch a field that has no dedicated card.

Editing config: admin UI = API

Every admin route is content-negotiated. The same URL serves an HTML page in a browser, an HTML fragment to htmx, and JSON to a script — whichever the client's Accept/HX-Request headers ask for. The UI is a thin representation of the API; nothing the UI can do is unavailable to a curl call.

# List agents as JSON (effective merged view: database overrides + YAML)
curl -H 'Accept: application/json' http://localhost:8421/admin/agents

# Update an agent (writes to the database, not to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/agents/bob \
     -H 'Content-Type: application/yaml' \
     --data-binary $'name: bob\nprovider: openai\nmodel: gpt-4o\n'

# Reset an override or tombstone — drops the database row, YAML reasserts
curl -X POST http://localhost:8421/admin/agents/bob/reset

# Read one infrastructure section as JSON
curl -H 'Accept: application/json' http://localhost:8421/admin/telemetry

# Update one section in place (writes that slice back to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/telemetry \
     -H 'Content-Type: application/yaml' \
     --data-binary $'fmt:\n  enabled: true\nsqlite:\n  enabled: true\n'

# Replace the whole config file in one shot (this writes to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/config \
     -H 'Content-Type: application/yaml' \
     --data-binary @coulisse.yaml

The single-section endpoints — /admin/auth, /admin/memory, /admin/storage, /admin/telemetry (plus the collection endpoints /admin/providers and /admin/mcp) — splice just their slice into the file and leave every other key untouched, so a partial write can't clobber an unrelated section.

Agent writes through /admin/agents go to the database, never to coulisse.yaml. Other sections (/admin/config, providers, MCP, auth, memory, telemetry, storage, judges, experiments, smoke tests) write to YAML. The two write paths are independent: editing an agent in the database has no effect on the file you committed to git.

Secrets render in cleartext. The section editors round-trip the raw YAML slice, so provider API keys, basic-auth passwords, OIDC client secrets, and OTLP headers appear in plaintext in the textarea. The admin surface is authenticated (see below) and the values already live in coulisse.yaml, but be aware the studio is not a secrets vault — don't share your screen on the auth editor.

File watcher: hand-edits hot-reload

Coulisse watches coulisse.yaml while it runs. Edit it in your editor, save, and the live config updates without a restart. The validator runs before any reload — a broken edit is logged and the previous in-memory config keeps serving traffic until you fix the file.

What hot-reloads today: the agents list (runtime + admin display), the judges and experiments lists (admin display only — the routing tables that consume them are still rebuilt on restart). What still requires restart: providers, MCP servers, memory backend, telemetry pipeline, auth.

YAML formatting

Admin saves go through serde_yaml round-trip serialization, so comments, blank lines, and key ordering are not preserved. If you want commented config, hand-edit the file — the watcher picks the change up the same way an admin save would. Comment-preserving writes are tracked as a follow-up.

Authentication

The admin surface is gated by the auth.admin scope in coulisse.yaml. Two mutually exclusive modes: HTTP Basic auth (good for local dev) or OIDC single sign-on (appropriate for shared deployments). Exactly one belongs under auth.admin.

The /v1/chat/completions and /v1/models endpoints use the separate auth.proxy scope — they are never gated by admin auth. SDK clients stay cookie-free even when the studio runs behind OIDC.

Basic auth

auth:
  admin:
    basic:
      password: choose-something-strong
      username: admin   # optional, defaults to "admin"

Every /admin/* request must carry Authorization: Basic <base64(user:pass)>. Browsers prompt via the native login dialog and cache credentials per origin.

OIDC (single sign-on)

Works with any OIDC-compliant IdP: Authentik, Keycloak, Auth0, Google, Microsoft, Okta.

auth:
  admin:
    oidc:
      issuer_url:    https://authentik.example.com/application/o/coulisse/
      client_id:     coulisse-admin
      client_secret: <confidential-client-secret>   # omit for public PKCE clients
      redirect_url:  http://localhost:8421/admin/
      scopes:        [email, profile]               # optional; openid is always added

On first request, the user is redirected to the IdP to log in; afterwards an encrypted session cookie keeps them authenticated on /admin/* until it expires (8 hours of inactivity).

Access control (who may log in) is delegated to the IdP. Coulisse treats "successfully authenticated by your IdP" as "authorized admin" — configure the allow-list in the IdP's application policy, not here.

Authentik setup: create a new OAuth2/OpenID Provider and Application, set the redirect URI to the redirect_url above (Authentik allows every subpath of it by default), and point Coulisse at the issuer URL of the provider. Add the application to the groups that should have access via Authentik bindings.

Sessions are in-memory: they evaporate on restart — users re-authenticate silently if their IdP session is still valid, otherwise they see the login page again.

Leaving it open

Omit the auth.admin block to leave the admin surface unauthenticated. That's fine on a loopback-only dev box, but never expose an unauthenticated admin surface to the network. If you'd rather terminate auth at your infrastructure layer, put Coulisse behind a reverse proxy (oauth2-proxy, Cloudflare Access, Caddy's forward_auth), a VPN, or an SSH tunnel.

How it's built

The studio is composed in the cli binary. Each feature crate (memory, telemetry, judges, experiments) owns its own admin module — its routes, its askama templates, and its view models. Cli wires them together: a single base.html shell, the auth wrapping, and a tower middleware that wraps non-htmx responses in the layout so bookmarked deep URLs render with full navigation.

Cross-feature views (e.g. tool-call panels inside a conversation page) are filled in via htmx fragments — the conversation page, owned by memory, embeds hx-get requests against telemetry and judges. No feature crate depends on another for its admin surface; the browser orchestrates the composition. Tailwind (loaded via CDN) provides styling, and a small embedded app.js (served at /admin/static/app.js) highlights the active nav item and raises a toast on every save. Everything ships in the single Coulisse binary; there is no separate frontend build step.

Editing the infrastructure sections (auth, memory, storage, telemetry, plus providers and mcp) lives in the cli crate rather than in the feature crates. Those edits only need the shared ConfigPersister trait and the section's own serde shape — not the feature crate's database — so they belong at the config layer that owns coulisse.yaml, not with the runtime/data admin pages the feature crates own.

User identification

Coulisse keeps separate memory per user. To do that, it needs to know who is making each request.

How users are identified

Requests identify the user via one of these fields, in order:

  1. safety_identifier (preferred — matches OpenAI's recent schema)
  2. user (deprecated, but still accepted)
{
  "model": "assistant",
  "safety_identifier": "alice@example.com",
  "messages": [...]
}

The identifier can be anything — an email, an internal user ID, a UUID, an opaque token. Coulisse derives a stable internal UUID from it:

  • If you pass a valid UUID, that's what's used.
  • Otherwise, a deterministic v5 UUID is derived from the string, so the same identifier always maps to the same user.

Requiring identification

By default, Coulisse requires every request to carry an identifier. Unidentified requests are rejected with an error. This is the safe default: memory only works if you know who you're talking to.

default_user_id: a single-user fallback

For local development or single-user deployments, you can declare a default_user_id in coulisse.yaml. When a request arrives without safety_identifier or user, Coulisse acts as if that default had been passed.

default_user_id: main        # everyone's anonymous requests bucket here

providers:
  anthropic:
    api_key: sk-ant-...

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929

With a default_user_id set:

  • Requests that omit both safety_identifier and user fall back to the default. They get memory like any other user — just scoped to that shared bucket.
  • Requests that do include an identifier still get their own scope.
  • All anonymous requests share one memory bucket and one rate-limit counter, because they all map to the same id.

When to set it

Good reasons:

  • Local / single-user setups where you don't want to bother sending an identifier.
  • Small deployments behind an auth layer that handles identity upstream but doesn't want to plumb it through.

Don't set default_user_id in multi-tenant deployments — every user would share one bucket, which defeats isolation. Leave it unset so missing identifiers are rejected.

Trust model

Everything keyed by user — conversation history, long-term memory, semantic recall, per-user MCP OAuth sessions, and rate-limit counters — is partitioned by the identifier on the request. Those partitions are airtight: a query never crosses users, and one user's handle can't reach another user's data.

But understand where the identifier comes from. By default it is asserted by the client in the request body (safety_identifier). In that mode the auth layer gates access to the proxy but does not bind the authenticated principal to the identifier, so any caller who can reach /v1/chat/completions can claim any identifier:

{ "model": "assistant", "safety_identifier": "someone-else", "messages": [...] }

This is the right default for two common shapes, and unsafe for a third:

  • Single-user / local. One identity, nothing to spoof.
  • Trusted first-party backend. A backend that authenticates its own users and sets safety_identifier honestly on their behalf gets full isolation. The identifier-setting boundary lives on a server you control.
  • Untrusted clients calling directly. If end users hold credentials and call Coulisse themselves — each able to send arbitrary JSON — any of them can read or write another user's memory and drive any MCP server that user has authorized, simply by claiming their identifier. Body-asserted identity does not isolate these clients.

Binding identity to the credential

For the third shape, set auth.proxy.identity: from_credential. Coulisse then ignores the body's safety_identifier and derives the user from the authenticated principal — the Basic username or the OIDC sub claim. A request that claims a different identifier is rejected with 403; the front desk now checks ID against the credential.

auth:
  proxy:
    oidc:
      issuer_url:    https://authentik.example.com/application/o/coulisse/
      client_id:     coulisse-proxy
      client_secret: <secret>
      redirect_url:  http://localhost:8421/v1/
    identity: from_credential

Two rules the server enforces at startup:

  • from_credential requires auth.proxy to configure basic or oidc — you can't bind to a credential that isn't checked.
  • It is mutually exclusive with default_user_id. A shared default bucket would be a silent bypass, so the combination is rejected rather than letting one quietly win.

With Basic, the username is the identity, so each distinct user needs distinct credentials — a single shared username collapses everyone into one bucket. OIDC is the natural fit for many users: each gets a distinct sub automatically. See the auth.proxy.identity reference for the field details.

API tokens

Coulisse can issue its own API keys — the same model as the OpenAI dashboard. It mints sk-coulisse-… bearer tokens, stores only their hash, gates the /v1/* proxy on them, tracks how much each token spends, and lets you cap that spend or revoke the token at any time.

This is the recommended way to expose Coulisse beyond loopback: hand each client (a teammate, a script, a deployed app) its own token instead of a shared password, and you get per-token attribution and control for free.

Enabling

Turn the scheme on under the proxy auth scope:

auth:
  proxy:
    tokens: {}   # the empty map is the switch

With this set, every /v1/* request must carry a valid token:

Authorization: Bearer sk-coulisse-…

A missing or unknown token gets 401; a revoked one also gets 401. Point any OpenAI SDK at Coulisse and pass the token as the API key — nothing else changes.

Until auth.proxy.tokens is set, the proxy stays open and any tokens you mint are inert (the studio notes this). The Tokens studio page and the coulisse token CLI are always available, so you can pre-mint before flipping the switch.

Identity binding

Every token binds to a principal — the user id that partitions memory, recall, and rate limits. Token auth therefore always implies credential-bound identity: the request runs as the token's principal, and a request body claiming a different safety_identifier is rejected with 403. Because the identity comes from the token, default_user_id is meaningless here and combining the two is rejected at startup.

Issue multiple tokens with the same principal to give one user several keys (laptop, CI, phone) that share one memory bucket. Issue distinct principals to keep clients fully isolated.

Budgets

Each token carries a spend budget, checked before every call — a request that would exceed it is rejected with 429 insufficient_quota (matching OpenAI's quota response) and no provider call is made:

BudgetBehaviour
unlimitedNever blocks. Spend is still tracked for monitoring.
totalLifetime cap. Blocks once cumulative spend reaches the limit.
monthlyPer-calendar-month cap (UTC). Resets on the first of each month.

Spend is computed from the same pricing table the cost tracker uses, summed per token in USD. Both streaming and non-streaming turns are charged.

Managing tokens

Studio

The Tokens page (under Configure in the studio nav) lists every token with its principal, budget, current-period spend, and lifetime spend. Use the form to mint a new one — the secret is shown once, immediately after creation, and never again. Each active token has a Revoke button.

CLI

# Mint an unlimited token
$ coulisse token create laptop --principal alice
created token 4f3c… for alice (unlimited)
sk-coulisse-9bQ…                       # the secret, on stdout only

# Mint with a $20/month cap
$ coulisse token create ci --principal alice --budget monthly --limit 20

# List tokens with spend
$ coulisse token list
4f3c…  active   laptop                unlimited           spent $1.27  [alice]
a91d…  active   ci                    $20.00 / month      spent $0.04  [alice]

# Revoke
$ coulisse token revoke 4f3c…
revoked 4f3c…

The secret prints to stdout and the context to stderr, so coulisse token create … > key.txt captures only the key. The CLI talks to the same SQLite database the server uses (WAL mode), so tokens minted while the server is running are live immediately.

How it's stored

The auth crate owns two tables in the shared database: api_tokens (the hashed secret, label, principal, budget, and timestamps) and token_usage (one row per charged turn, in integer micro-USD). The plaintext secret exists only in the response to the mint call — Coulisse keeps a SHA-256 digest and nothing more, so a database leak never exposes a usable key.

Rate limiting

Coulisse enforces per-user token limits across three rolling windows: hour, day, and month. Limits are set by the client, per request — not in the YAML — so callers can plug Coulisse into existing quota schemes without redeploying the server.

How it works

  1. Each request carries optional limit hints in its metadata field: tokens_per_hour, tokens_per_day, tokens_per_month.
  2. Before the model is called, Coulisse looks up the user's current usage in each window. If any counter is already at its cap, the request is rejected with 429 Too Many Requests.
  3. If the request passes, Coulisse runs it. On success, the total tokens consumed (request + response) are added to the user's counters.
  4. Counters reset on fixed boundaries: every hour, every 24 hours, every 30 days (aligned to UTC windows from the Unix epoch).

Sending limits

Put the caps in the metadata object. Values are strings (OpenAI's metadata contract), parsed as non-negative integers:

{
  "model": "assistant",
  "safety_identifier": "alice@example.com",
  "metadata": {
    "tokens_per_hour": "50000",
    "tokens_per_day": "500000",
    "tokens_per_month": "5000000"
  },
  "messages": [
    {"role": "user", "content": "Hi!"}
  ]
}

All three keys are independent and all are optional — send only the windows you care about. Omit the whole metadata object and the request is unlimited.

When a limit is hit

The server responds with:

  • Status: 429 Too Many Requests
  • Header: Retry-After: <seconds> — time until the offending window resets
  • Body:
{
  "error": {
    "type": "rate_limited",
    "message": "daily token limit exceeded: used 512000/500000, retry after 40213s"
  }
}

The message names which window tripped (hourly, daily, monthly), how many tokens were used, the cap, and the seconds to wait.

Invalid metadata

If a metadata value isn't a valid non-negative integer, the server returns 400 Bad Request:

{
  "error": {
    "type": "invalid_request",
    "message": "metadata key 'tokens_per_hour' must be a non-negative integer, got 'abc'"
  }
}

Scope and isolation

  • Per user. Each user (keyed by safety_identifier or the fallback user field) has isolated counters.
  • Anonymous requests can't be rate-limited. Coulisse needs an identifier. In setups with a default_user_id (see User identification), all anonymous requests share that user's counter.
  • Per process. Counters live in memory. If you run multiple Coulisse instances behind a load balancer, each has its own view — for shared quotas, limit upstream (in a gateway) instead.
  • Lost on restart. Counters are not persisted. This is deliberate for now; durable accounting is on the roadmap.

Why per-request limits instead of YAML?

Quotas usually live in your user/billing system, not your model-routing config. Putting limits in the request lets the caller decide — e.g. your app looks up the user's plan, fills in the numbers, and forwards the request. Coulisse just honors what you send.

Per-user OAuth for MCP servers

Coulisse can authenticate each end-user independently with third-party MCP servers (Todoist, Atlassian, GitHub, Google, and others) using OAuth 2.0/2.1. When an agent calls a tool on an OAuth-enabled MCP server, Coulisse automatically uses the credentials that the requesting user has authorized.

⚠️ Trust boundary: Coulisse trusts the user_id passed in the chat request's safety_identifier field the same way Stripe trusts a customer_id — it assumes the caller is your authenticated backend, not an end-user directly. If you expose Coulisse's /v1/ endpoint directly to untrusted clients without an auth proxy, any client can claim any user_id and access another user's connected accounts. Always place an auth proxy (your own backend, a gateway, or Coulisse's auth.proxy OIDC scope) between Coulisse and untrusted callers before deploying with OAuth-enabled MCP servers.

Just point at the URL

For a spec-compliant MCP server, you write nothing about OAuth at all. A remote MCP is just a url::

mcp:
  todoist:
    url: https://ai.todoist.net/mcp

URL-based servers get per-user OAuth discovery + Dynamic Client Registration automatically. Tokens land in Coulisse's per-user vault, keyed by (server, user_id) — no Node process, no shared on-disk cache, no browser-callback port. The transport is inferred from the path (/sse in the path → SSE, otherwise streamable HTTP); force it with an explicit transport: http|sse when the path doesn't make it obvious.

If you need to tune the flow, the oauth: block has three uses:

  • Disable auth on a public, no-auth HTTP MCP: oauth: false.
  • Set scopes while keeping automatic discovery: oauth: { scopes: [a, b] } (mode defaults to discover).
  • Static credentials for a provider without Dynamic Client Registration: oauth: { mode: static, ... } (see below).

Servers that only honour mcp-remote's client id

A few providers (Todoist today) haven't opened registration and only accept the grandfathered client id baked into the mcp-remote CLI. For those, declare the stdio command form yourself and let mcp-remote carry the token:

mcp:
  todoist:
    command: npx
    args: [-y, mcp-remote, https://ai.todoist.net/mcp]

This runs mcp-remote as a stdio child the normal way; Coulisse doesn't rewrite or inspect it. Use the plain url: form for any server that supports Dynamic Client Registration.

Two flavours

oauth: blocks come in two modes, picked with the mode: discriminator (which defaults to discover, so you only write mode: to select static):

  • mode: discover (default) — MCP-spec OAuth 2.1 with discovery + Dynamic Client Registration. Coulisse reads the provider's authorization-server metadata from <mcp_origin>/.well-known/oauth-authorization-server and registers itself as a client on first use. No credentials in YAML. This is the right choice for modern MCP servers — Todoist, Atlassian (mcp.atlassian.com), Linear, and so on, and is what a bare url: uses.
  • mode: static — classic OAuth 2.0 with pre-registered app credentials. You register Coulisse as a client at the provider's developer console and paste the resulting client_id / client_secret here. Use this for providers that don't support Dynamic Client Registration.

Both modes drive the same per-user token flow: tokens are stored in the vault keyed by (server_name, user_id), never shared across users.

How it works

  1. Tool call hits NotConnected: The user makes a chat request, the agent calls a tool on the MCP server, Coulisse looks up (server, user_id) in the vault, finds no token, and returns a NotConnectedTool placeholder whose tool result contains a per-user, single-use connect URL built from the HMAC key. The LLM reads that result and relays the URL to the user.

    For agents that haven't pinned an only: list (the common case — "give the agent every tool the server exposes"), Coulisse can't know the real tool schemas until someone has authorised at least once. Until then it surfaces a single sentinel tool named connect_<server> whose description tells the LLM to call it when the user asks to use that server. Calling it returns the same per-user connect URL. Once the user authorises, the sentinel goes away and the real tool list takes its place transparently.

  2. User clicks the link: lands on GET /mcp/{server}/connect?token=… on Coulisse. Coulisse validates the HMAC, then for discover mode only, lazily runs discovery + Dynamic Client Registration if it hasn't yet (cached in mcp_oauth_clients afterwards). Discovery is a two-step walk: first <mcp_origin>/.well-known/oauth-protected-resource (RFC 9728) to find which issuer hosts the authorization server (Todoist's MCP lives on ai.todoist.net, its auth server lives on todoist.com), then <issuer>/.well-known/oauth-authorization-server (RFC 8414) for the actual endpoints. Coulisse then 302s to the provider's authorization_endpoint.

  3. User authorizes: signs into their own account at the provider, sees a consent screen, and the provider redirects back to Coulisse's callback.

  4. Token stored: Coulisse exchanges the code for tokens and stores them encrypted in mcp_oauth_tokens under the user's id.

  5. Subsequent tool calls succeed: the next chat turn on the same user_id spawns a real per-user MCP session backed by the stored token.

Every user authorizes independently. Alice's token is never usable by Bob — they have separate vault rows, separate MCP sessions, and separate consent flows.

YAML configuration

public_base_url: http://localhost:8421   # see "Public base URL" below

mcp:
  todoist:
    url: https://ai.todoist.net/mcp
    # oauth is implied; add a block only to override scopes:
    # oauth: { scopes: [data:read_write] }

auth:
  mcp_consumer_secret: "${COULISSE_MCP_SECRET}"

Nothing else to fill in — a bare url: already implies discover-mode OAuth. Coulisse handles discovery and DCR on first use.

Static mode (for non-DCR providers)

mcp:
  jira:
    url: https://mcp.atlassian.example.com
    oauth:
      mode: static
      authorization_url: https://auth.atlassian.com/authorize
      client_id: "${JIRA_CLIENT_ID}"
      client_secret: "${JIRA_CLIENT_SECRET}"
      redirect_uri: https://coulisse.example.com/mcp/jira/oauth/callback
      scopes:
        - read:jira-work
        - write:jira-work
      token_url: https://auth.atlassian.com/oauth/token

auth:
  mcp_consumer_secret: "${COULISSE_MCP_SECRET}"

oauth: block fields

FieldModeDescription
modebothdiscover or static
scopesbothOAuth scopes to request (optional; discover falls back to scopes_supported)
authorization_urlstaticProvider's OAuth authorize endpoint
client_idstaticOAuth application client ID
client_secretstaticOAuth application client secret; ${ENV} expansion supported
redirect_uristaticMust match what you registered with the provider
token_urlstaticProvider's token exchange endpoint

For discover mode, the redirect_uri is computed automatically from public_base_url as {public_base_url}/mcp/{server}/oauth/callback. The authorization, token, and registration endpoints all come from discovery.

Public base URL

Coulisse needs to know its own externally reachable URL to build OAuth redirect URIs and the per-user connect links surfaced to LLMs:

public_base_url: https://coulisse.example.com   # no trailing slash

If omitted, defaults to http://localhost:{port}, which is right for personal and local-dev setups. Set it explicitly when Coulisse runs behind a tunnel, reverse proxy, or on a public hostname — the same value must match whatever the OAuth provider sees as the redirect URI host.

Secrets (zero config by default)

Coulisse needs two long-lived 32-byte secrets when an OAuth-enabled MCP server is configured:

  • vault key — encrypts stored tokens (and any cached DCR client_secret) at rest with AES-256-GCM
  • HMAC key — signs the per-user connect links Coulisse mints for the LLM, plus the OAuth state token

You don't have to manage these for local use. On first boot Coulisse generates both and writes them to .coulisse/secrets.env (mode 0600, already .gitignored), then reuses the file on every subsequent start. Back this file up. Losing it invalidates every token in mcp_oauth_tokens — users have to re-authorize each connected MCP server.

For deployments that source secrets from a vault/k8s/CI, set them as environment variables and Coulisse will use those instead of touching the on-disk file:

VariablePurpose
COULISSE_VAULT_KEY32 bytes, base64-encoded. Overrides the on-disk vault key.
COULISSE_HMAC_KEY32 bytes, base64-encoded. Overrides the on-disk HMAC key.

Both are optional. Resolution order: env vars > .coulisse/secrets.env > generated on the fly.

One additional optional secret gates the admin endpoint only:

VariablePurpose
COULISSE_MCP_SECRET (via auth.mcp_consumer_secret)Arbitrary string. When set, gates POST /mcp/{server}/connect-link. When unset, that endpoint returns 503 and the per-user GET /connect flow keeps working.

Endpoints

Coulisse exposes three OAuth-related HTTP routes:

GET /mcp/{server}/connect

The user-facing route. The URL Coulisse mints inside NotConnectedTool looks like this and is what the LLM hands the user:

{public_base_url}/mcp/{server}/connect?token={hmac_signed_token}

The token is HMAC-signed with COULISSE_HMAC_KEY and embeds the user_id plus a 10-minute expiry. The handler:

  1. Validates the HMAC and expiry.
  2. For discover mode: ensures the server is registered (lazily runs discovery + DCR on the first hit; reuses the cached mcp_oauth_clients row on subsequent hits).
  3. 302-redirects to the provider's authorization endpoint with a fresh state token carrying the same user_id.

POST /mcp/{server}/connect-link

Admin-facing alternative. Bearer-authed with COULISSE_MCP_SECRET. Useful when your backend wants to email a user a connect link without going through the LLM's tool result:

POST /mcp/{server}/connect-link?user_id=<user_id>
Authorization: Bearer <mcp_consumer_secret>

Response 200:

{ "url": "https://...provider.../authorize?client_id=...&state=<signed_token>" }

Hand this URL to your end-user. Valid for 10 minutes.

Error codes:

CodeReason
401Wrong or missing consumer secret
404Server name not found in config
422user_id query parameter missing, or server exists but has no oauth: block
502Discovery or DCR failed (discover mode only — check Coulisse logs)

GET /mcp/{server}/oauth/callback

The provider's redirect target. Coulisse validates the state HMAC, exchanges the authorization code for tokens, stores them encrypted in SQLite, and shows an HTML success page to the user.

A tampered or expired state returns HTTP 400.

Token + client storage

Two tables under the shared SQLite database, both maintained by the mcp crate's schema migrator:

  • mcp_oauth_tokens — encrypted per-user tokens keyed by (server_name, user_id). AES-256-GCM with the nonce prepended. Connecting again overwrites the previous token.
  • mcp_oauth_clients — cached Dynamic Client Registration for discover mode servers. One row per server. The client_secret is encrypted when present; the metadata_json document is stored plaintext (the provider's authorization-server metadata isn't a secret). Coulisse-wide, not per-user — the client_id identifies the Coulisse instance, not the end user.

Per-user session lifecycle

stdio transport: Each (user_id, server_name) gets its own spawned process on first use, held in an LRU cache (cap: 256 by default, idle timeout: 30 minutes). The access token is passed as the MCP_OAUTH_TOKEN environment variable. (Most spec-compliant MCP servers use HTTP transport — the stdio path is for servers you launch via an explicit command:, such as a self-declared npx mcp-remote <url> shim.)

HTTP transport: A per-user connection is established with Authorization: Bearer <token> as a default header. Same LRU cache applies.

When a user hasn't connected yet

If an agent calls a tool on an OAuth-enabled MCP server and the calling user has no stored token (or the token is expired), the tool returns a placeholder result containing the connect URL. The LLM reads it and relays it to the user:

Not connected: the user has not authorized access to the 'todoist' MCP server. Show them this link and ask them to open it to link their account — the link is single-use and tied to their identity, do not share it with anyone else: http://localhost:8421/mcp/todoist/connect?token=…

This is a tool result, not a 500 error. The user clicks the link, authorizes, and the next chat turn just works. No backend intervention required for the common case.

Per-user memory

Every request gets an isolated, persistent memory scope based on its user identity. In users: per-request mode, that identity comes from safety_identifier (or the deprecated user field) on each request; in the default users: shared mode, every request shares one hardcoded identity (and one memory bucket). See User identification. Coulisse tracks two kinds of memory:

  • Conversation history — the running transcript of messages the user has exchanged. Always on.
  • Long-term user state — durable facts and preferences, embedded for semantic recall. Off by default; opt in with user_state: true.

You don't manage either of these by hand — both are wired into every request automatically. When user_state is on, Coulisse also decides what is worth remembering after each turn.

What happens on each request

  1. Coulisse identifies the user — from safety_identifier / user in per-request mode, or from the shared identity in shared mode.
  2. It pulls the user's recent messages, fitting as many as possible into the context window.
  3. If long-term user state is on, it runs a semantic recall against the user's stored facts and picks the top matches.
  4. It builds the final prompt: agent preamble → recalled facts (if any) → recent history → new message.
  5. The model's reply is sent back and saved to the user's transcript.
  6. If user_state is on, a background task asks a cheap model "any durable facts to remember from this exchange?" and stores novel ones.

Step 6 does not block the HTTP response — the user gets their answer first; long-term memory grows in the background.

Isolation guarantees

User isolation is enforced by the API: Store::for_user(id) returns a handle scoped to a single user, and every SQL query bound through it filters on that user id. There is no code path that mixes data across users.

How long-term recall works

When user_state: true, Coulisse embeds each stored fact as a vector at write time. On every request, it embeds the incoming user message and retrieves the top-k most similar facts by cosine similarity. That's how context from a conversation two weeks ago can surface when it becomes relevant again.

The recalled facts are formatted as a system block titled Known about the user: and injected into the prompt before the conversation history.

Auto-extraction ("remember what matters")

When user_state: true, every completed exchange fires a background task that:

  1. Sends the last user-turn + assistant-turn to a cheap model with a focused prompt: "list any durable facts or preferences about the user; return [] if nothing worth keeping."
  2. Parses the JSON response.
  3. For each extracted fact, calls remember_if_novel — which embeds the fact and skips it if cosine similarity against an existing memory exceeds dedup_threshold (default 0.9).

Failures (bad JSON, timeout, provider error) are logged at warn and swallowed — the user already got their response. Extraction is best-effort.

To disable, omit the user_state: field or set it to false. Conversation history is unaffected either way.

Embedders

ProviderSupported modelsNotes
openaitext-embedding-3-small, text-embedding-3-large, text-embedding-ada-002Default pairing for OpenAI-first setups.
voyagevoyage-3.5, voyage-3-large, voyage-3.5-lite, voyage-code-3, voyage-finance-2, voyage-law-2, voyage-code-2Anthropic officially recommends Voyage for embeddings. Requires an explicit api_key.
hashn/aDeterministic bag-of-words, offline only. No semantic understanding — use only for tests and air-gapped development.

When user_state: true and you don't pin an embedder explicitly, Coulisse picks one for you (see auto-derivation). Startup logs the chosen embedder.

What gets stored where

DataScopeLifetime
Conversation messagesPer userSQLite (messages table)
Long-term memories + vectorsPer userSQLite (memories table, BLOB embeddings)
Tool invocationsPer userSQLite (tool_calls table, linked to messages.id)
Judge scoresPer userSQLite (scores table, linked to messages.id)
User identifier → internal IDSharedDerived deterministically — no storage needed

Each memory row carries the id of the embedder that produced it. If you swap the embedder, old vectors become ineligible for recall (they'd be scored in the wrong space). They stay in the database but are silently ignored until you re-embed them.

Storage location

The database lives at .coulisse/coulisse-memory.db — the project state directory next to your coulisse.yaml, shared with the log, PID, MCP secrets, and uploaded files. The path is not configurable; everything Coulisse persists stays under .coulisse/.

Docker

Mount the .coulisse/ directory so the database (and the rest of Coulisse's state) survives container restarts:

docker run \
  -v $(pwd)/.coulisse:/app/.coulisse \
  -v $(pwd)/coulisse.yaml:/app/coulisse.yaml:ro \
  -p 8421:8421 \
  coulisse

See memory configuration for the full YAML schema.

File attachments (OpenAI-compatible storage)

Coulisse exposes a /v1/files API that matches the OpenAI Files API shape exactly. Any OpenAI-compatible SDK works without modification.

What this lets you do

  • Upload a file once, reference it by file_id in any subsequent chat request.
  • Pass multimodal content (images, PDFs, text) to an LLM backend that supports it — Coulisse stores the file and forwards it transparently.
  • Set a storage quota so the disk never fills up (oldest files evicted first).

Endpoints

MethodPathDescription
POST/v1/filesUpload a file (multipart/form-data)
GET/v1/filesList all uploaded files
GET/v1/files/:idGet metadata for one file
GET/v1/files/:id/contentDownload file content
DELETE/v1/files/:idDelete a file (idempotent)

Upload example

curl -X POST http://localhost:3000/v1/files \
  -F "file=@cv.pdf;type=application/pdf" \
  -F "purpose=assistants"

Response:

{
  "id": "file-01j9abc...",
  "object": "file",
  "bytes": 42381,
  "created_at": 1722000000,
  "filename": "cv.pdf",
  "purpose": "assistants",
  "content_type": "application/pdf"
}

Then reference the file in a chat request:

{
  "model": "gpt-4o",
  "messages": [{
    "role": "user",
    "content": [
      { "type": "text", "text": "Summarise this CV in three bullet points." },
      { "type": "input_file", "file_id": "file-01j9abc..." }
    ]
  }]
}

Configuration

Add a storage: block to coulisse.yaml. Everything has a default — if you omit the block, the filesystem backend is used with no quota. Filesystem blobs always live under .coulisse/files, next to your config; the path is not configurable.

storage:
  backend: fs           # "fs" (default) or "s3"
  max_file_bytes: 52428800   # 50 MB per file — omit for no limit
  max_total_bytes: 524288000 # 500 MB total — omit for no limit

Docker: mount the .coulisse/ directory to persist uploaded files (and the rest of Coulisse's state) across container restarts.

S3-compatible backend

Swap backend: s3 to store blobs in AWS S3, Cloudflare R2, or MinIO:

storage:
  backend: s3
  s3:
    bucket: my-coulisse-files
    region: eu-west-3
    # endpoint_url: http://localhost:9000  # for MinIO / local S3
  max_file_bytes: 52428800

Credentials are read from the standard AWS credential chain (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars, IAM role, ~/.aws/credentials, etc.).

Note: Set endpoint_url when using MinIO or another self-hosted S3-compatible service — path-style addressing is enabled automatically in that case.

Allowed file types

Coulisse validates file content via magic bytes (not just the declared Content-Type) and rejects anything outside this list:

  • text/*
  • image/*
  • application/pdf
  • application/json
  • application/octet-stream

Attempting to upload an executable or other unsupported type returns 415 Unsupported Media Type.

Storage limits and eviction

SettingDefaultEffect
max_file_bytesno limit413 Payload Too Large if exceeded
max_total_bytesno limitOldest file is deleted to make room

Eviction is FIFO: when a new upload would push the total over max_total_bytes, the oldest file (by created_at) is deleted first, then the next oldest, until there is room.

S3 caveat: quota accounting is best-effort under concurrent load — two simultaneous uploads might both pass the check and briefly exceed the limit. The next upload will evict back within bounds.

Deduplication

Coulisse computes a SHA-256 of each uploaded file. If you upload the same bytes twice, the second call returns the same file_id — no storage is consumed and no blob is written twice.

v1 limitation — deduplication is global, not per-user. If two different callers upload identical bytes, they receive the same file_id and share the underlying blob. A DELETE by either caller removes the file for both. This is safe when Coulisse runs as a single-tenant tool (one team, one trusted process). Do not expose Coulisse to mutually untrusted users until per-user deduplication is implemented (tracked in #61).

What Coulisse does NOT do

Coulisse does not parse, extract, or summarise file content. It stores the bytes and forwards them to the LLM backend. If the model supports the file type (e.g. GPT-4o reads PDFs natively), it will process it. If it does not, the request fails at the LLM level — Coulisse surfaces the error as-is.

If you want structured extraction (e.g. parse a CV into memory facts), that is a pattern you implement with a Coulisse agent that calls memory.put — see the per-user memory chapter.

Structured outputs

Coulisse lets the caller pin the shape of the reply, not just its language. Send a JSON Schema and you get back a JSON value that conforms to it — validated server-side before it ever reaches you.

This is the same response_format field OpenAI's API exposes, so existing SDK calls work unchanged. The difference: Coulisse enforces it for every provider, including models that have no native structured-output mode. The schema is taught to the model through the system preamble and the reply is validated (and repaired) on the way out, so anthropic, gemini, groq, cohere, and deepseek behave the same as openai.

How to send it

Add a response_format object to the request. Two shapes are supported.

Any JSON object

{
  "model": "assistant",
  "safety_identifier": "user-123",
  "messages": [{"role": "user", "content": "Give me a config skeleton"}],
  "response_format": {"type": "json_object"}
}

The reply is guaranteed to be a single valid JSON value — no markdown fences, no prose.

A specific JSON Schema

{
  "model": "assistant",
  "safety_identifier": "user-123",
  "messages": [{"role": "user", "content": "Extract the person from: Ada Lovelace, 36"}],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "person",
      "description": "a single person record",
      "schema": {
        "type": "object",
        "properties": {
          "age": {"type": "integer"},
          "name": {"type": "string"}
        },
        "required": ["age", "name"],
        "additionalProperties": false
      }
    }
  }
}

The json_schema object mirrors OpenAI's: name (required), schema (required, a standard JSON Schema), and optional description and strict. The reply is validated against schema before it's returned.

Omit response_format entirely (or send {"type": "text"}) for a normal free-form reply.

How it reaches the model

Coulisse appends a short instruction to the system preamble before calling the provider — for a json_schema request it embeds the schema, its name, and (if given) its description, and tells the model to emit only the raw JSON value. Your own system messages and the agent's coulisse.yaml preamble are preserved.

After the model replies, Coulisse:

  1. Extracts the JSON, tolerating a stray markdown code fence if the model added one.
  2. Validates it — parses it, and for json_schema checks it against the schema.
  3. Returns the cleaned JSON as the reply content (re-serialized, so any surrounding prose or fences are stripped).

Repair on failure (non-streaming)

If validation fails, Coulisse re-prompts the same model with its own invalid reply plus the exact validation error, up to two times. Each retry is targeted ("you were missing the required field age"), not a blind re-roll. Token usage across every attempt is summed into the response's usage so billing and rate limits stay accurate.

If the reply still doesn't validate after the retries, the request fails with 502 Bad Gateway — the model couldn't comply.

Streaming

With stream: true, the instruction is injected the same way and tokens stream to you as usual. Coulisse validates the full accumulated reply once the stream ends. Because already-streamed tokens can't be retracted, a validation failure surfaces as an SSE error event rather than a repair retry — so for guaranteed-valid-or-error semantics, prefer non-streaming requests with structured output.

Errors

StatusWhen
400The supplied JSON Schema itself is malformed (rejected before any model call).
502The model's reply never validated, even after repair retries.
{
  "error": {
    "type": "upstream_error",
    "message": "response did not match the schema: \"age\" is a required property"
  }
}

Response language

Coulisse lets the caller pin the language the model replies in. Without it, the model infers language from the user's message — which can drift when the user switches languages mid-conversation or types short, ambiguous prompts. With it, every response comes back in the language you asked for.

Language is set per request, via the metadata object. The caller decides — Coulisse doesn't maintain a user-level language preference.

How to send it

Add a language key to metadata. The value is a BCP 47 tag (RFC 5646):

{
  "model": "assistant",
  "safety_identifier": "user-123",
  "metadata": {
    "language": "fr-FR"
  },
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}

Any valid BCP 47 tag works: en, fr, fr-FR, es-MX, zh-Hant, ja-JP. The tag is validated — malformed values come back as 400 Bad Request. Omit the key entirely to let the model pick.

How it reaches the model

Coulisse appends a short instruction to the system preamble before calling the provider — something like Always reply in French, even when the user writes in a different language. Do not include translations in any other language.. The instruction is phrased as a hard constraint so the model doesn't mirror the user's language or append a parenthetical translation. For tags in the built-in language-name table (common ISO 639-1 subtags: en, fr, es, de, it, pt, ja, zh, ko, ar, nl, pl, ru, sv, tr, hi), the instruction uses the English name. For anything else, the raw tag is passed through — frontier models understand BCP 47 directly, so cy (Welsh) works fine.

The instruction is added once per request, as the first system message. Your own system messages in the messages array still apply, and agent preambles from coulisse.yaml are preserved.

Real-world example: country code to language

A common pattern is to derive the language from the caller's locale on your side — phone country code, IP-based geolocation, browser Accept-Language, a user profile setting — and forward the resulting tag:

{
  "model": "assistant",
  "safety_identifier": "+33612345678",
  "metadata": {
    "language": "fr-FR"
  },
  "messages": [
    {"role": "user", "content": "What's the weather?"}
  ]
}

Coulisse doesn't do the mapping itself. It takes the tag you send and asks the model to respond in that language. That keeps the metadata format stable and the country-code-to-language table (which changes slowly but does change) out of server code.

Errors

A malformed tag returns 400 Bad Request:

{
  "error": {
    "type": "invalid_request",
    "message": "invalid `metadata.language`: invalid language tag: ..."
  }
}

Empty-string and whitespace-only values are rejected the same way.

Token cost tracking

Coulisse converts each chat completion's token usage into a USD cost using a vendored snapshot of LiteLLM's model pricing table. The cost lands in the per-turn llm_call event alongside the raw token counts, so the studio UI shows it next to every model call.

There's nothing to enable. As long as a turn produces token usage and the model is in the table, you'll see a $0.0042-style badge on the corresponding llm_call row in the per-turn event tree.

How it's computed

For each completion Coulisse looks up the configured (provider, model) pair in the vendored table and multiplies:

  • input_tokens × input_cost_per_token
  • output_tokens × output_cost_per_token
  • cache_creation_input_tokens × cache_creation_input_token_cost (Anthropic prompt-cache writes)
  • cached_input_tokens × cache_read_input_token_cost (Anthropic prompt-cache reads)

Missing fields in the upstream table are treated as zero — fine for providers like Groq that don't price cache tokens. Models that don't appear in the table at all yield a null cost: the request still succeeds, the llm_call event still records the token usage, and the studio simply omits the cost badge.

Refreshing the pricing table

The snapshot lives at crates/providers/data/model_prices.json and is checked into git. New models are added upstream regularly; refresh the snapshot with:

just refresh-prices

This downloads the latest version from LiteLLM's main branch and overwrites the local file. The diff lands in git like any other change so you can review what moved before committing.

There's no live fetching at runtime: cost lookup only ever reads from the vendored snapshot. That keeps the request path free of network dependencies and makes pricing updates an explicit, reviewable action.

What's not (yet) covered

  • EUR or other currencies. Cost is stored and displayed in USD only. If there's demand for a configurable display currency (telemetry.display_currency: { code: EUR, usd_rate: 0.92 }-style), it can be added without changing the on-disk format.
  • Cost-based rate limiting. Rate limits currently work on token counts. Cost is recorded but not yet enforced; a future usd_per_day: knob would consume the same data.
  • Per-tool / per-MCP cost. Tool calls have their own tool_call events but don't carry a cost themselves. Costs are charged to the parent llm_call event, which is the only place tokens are spent.
  • Custom or unlisted models. Self-hosted models or models that LiteLLM hasn't added yet won't have a price. There's no YAML override path today; if you need one, open an issue describing the use case.

Skills

A skill is a reusable bundle of instructions you can hand an agent on demand — the same idea as skills in Claude Code or Codex. You write a folder with a SKILL.md describing how to do something ("review a resume", "negotiate a salary", "triage a bug report"), and any agent that opts in can pull those instructions in exactly when they're relevant.

Not to be confused with the coulisse skill CLI command, which installs Coulisse itself as a skill into your coding assistant. This page is about the skills: config section — a primitive alongside mcp, tools, and subagents that your own agents use.

The point is progressive disclosure. An agent's preamble is always in context and costs tokens on every turn. A skill is different: only its one-line description sits in the model's tool list, cheaply advertising that the skill exists. The full body is delivered only when the model decides to use it. You can ship a dozen detailed playbooks without bloating every request.

Writing a skill

A skill is a directory containing a SKILL.md. The file is optional YAML frontmatter followed by a markdown body:

skills/
  resume-review/
    SKILL.md
    rubric.md
---
name: resume-review
description: Review a candidate resume against a role and produce structured feedback.
---

Score the resume on clarity, relevance, and impact. For the scoring rubric and
weights, read the `rubric.md` file bundled with this skill.

Return: a one-line verdict, three strengths, three gaps, and a hire/no-hire lean.

Frontmatter fields:

  • name — how the skill is addressed in YAML and exposed to the model as a tool. Optional; defaults to the directory name. Use a tool-safe name (letters, digits, _, -).
  • description — the one-line summary the model sees in its tool list. This is what it uses to decide whether to reach for the skill, so write it for the caller, not for yourself.

No frontmatter is fine too — a bare SKILL.md becomes a skill named after its directory with an empty description.

Bundled resource files

Anything else in the skill's directory is a bundled resource the skill body can point at — a rubric, a template, a checklist, a reference doc. The model fetches them on demand through a built-in skill_file tool (one extra level of progressive disclosure: the body loads on use, a referenced file loads only when the model follows the pointer).

Resource access is sandboxed: only files discovered under the skill's own directory at load time are reachable. A skill cannot read outside its folder.

Enabling skills

By default Coulisse scans ./skills — dropping a folder there is all it takes. Point elsewhere with the top-level block:

skills:
  dir: ./playbooks

A missing directory is not an error; it simply yields no skills.

Agents opt in by name, the same way they opt into MCP tools and subagents:

agents:
  - name: recruiter
    provider: anthropic
    model: claude-sonnet-4-6
    skills: [resume-review, salary-negotiation]

Names that don't match a loaded skill are ignored. An agent with no skills: array gets none.

What the model sees

When an agent has at least one usable skill, its tool list gains:

  • one tool per listed skill — named after the skill, described by its description. Calling it returns the skill's full SKILL.md body.
  • skill_file — reads a bundled resource by skill name and path (relative to that skill's directory).

A typical flow: the model reads a skill's description, decides it's relevant, calls the skill tool to load the instructions, then follows any pointers to bundled files via skill_file.

Skills vs. MCP tools

Skills carry instructions; MCP servers carry capabilities and side effects. A skill tells an agent how to do something; it does not run code, touch the network, or mutate state. If a skill's procedure needs to execute something — score a document with a script, hit an API, write a file — that step belongs in an MCP tool the skill's body tells the model to call. Keeping the boundary here is deliberate: skills stay pure, inspectable text, and anything with effects goes through MCP where it's configured and observed.

Async tasks

Coulisse's primary surface is the OpenAI-compatible /v1/chat/completions endpoint — synchronous, request/response. That's the right shape for chat-driven workflows where a user is waiting on a reply.

It's the wrong shape for everything else: research that takes minutes, scheduled checks, agents that should keep running after the user closes the tab, narration emitted as work progresses. For those, Coulisse has an async lane built on top of the same agent runtime.

How it works

A tasks table stores work the system has accepted but hasn't completed:

queued → running → done | errored

When something fires off a task — currently the dispatch_task tool from inside an agent run, with cron/webhook/MCP-event triggers planned next — a row lands in the table. A background worker pool inside the same Coulisse process drains the queue: each worker pulls the oldest queued task, transitions it to running, calls the same Agents::complete path the sync HTTP endpoint uses, and writes the final reply (or the error) back to the row.

Workers don't know how their task got enqueued. They just see "run agent X with prompt Y for user Z." That's deliberate — every trigger type produces the same shape of work, so adding new triggers (cron next, then webhooks, then MCP event subscriptions) doesn't touch the worker code.

Dispatching from an agent

Any agent with a configured task queue gets a built-in dispatch_task tool:

{
  "name": "dispatch_task",
  "description": "Enqueue a fire-and-forget background task...",
  "parameters": {
    "type": "object",
    "properties": {
      "agent":  { "type": "string" },
      "prompt": { "type": "string" }
    },
    "required": ["agent", "prompt"]
  }
}

The agent calls it with the target agent name and an initial prompt; the tool returns a task_id immediately and the worker pool runs it in the background. The dispatching agent gets back only the id — not the result. This is the difference from the synchronous subagent dispatch (subagents: [...] in YAML), which blocks until the target replies.

When to use which:

  • Subagent dispatch (sync) — you need the answer before you can continue. "Ask user-tester for friction analysis, then summarize."
  • dispatch_task (async) — the work is genuinely fire-and-forget, or it's too long to make the caller wait. "Start a research task on X. I'll narrate progress as it runs."

Inspecting from an agent

Agents that get the read side of the queue also see a tasks_status tool:

{
  "name": "tasks_status",
  "description": "Report recent background tasks across every agent...",
  "parameters": {
    "type": "object",
    "properties": {
      "limit": { "type": "integer", "minimum": 1, "maximum": 100 },
      "state": { "type": "string", "enum": ["queued", "running", "done", "errored"] }
    },
    "required": []
  }
}

The tool returns a JSON {"tasks": [...]} array, newest first. Each entry carries the agent name, state, a truncated prompt, and the timestamps — enough for an orchestrator to answer "what's going on right now?" from chat, without you having to open /admin/live.

Boot-time reaping

When Coulisse stops mid-task, the worker dies and the row stays at running forever — there's no one to mark it done or errored. On the next coulisse start, before any worker spawns, the queue is swept: every task still in running becomes errored with the reason process restarted before task completed. This pairs naturally with a boot trigger: the wake-up agent sees the reaped rows via tasks_status (filtered by state=errored) and can decide whether to re-dispatch them, escalate, or move on.

Configuration

There's no tasks: YAML section yet — the queue is always on, with four workers by default. A future tasks: block will let you tune worker count and disable the queue entirely if you don't want async work running in your deployment.

Architecture notes

  • Lives in crates/tasks/. Owns the tasks SQLite table; no other crate touches it.
  • The TaskQueue and TaskStatus traits live in coulisse-core so agents can build the dispatch_task and tasks_status tools without depending on tasks directly. Mirrors the existing ScoreLookup / OneShotPrompt / AgentResolver pattern.
  • Workers run in cli/src/workers.rs, spawned alongside the HTTP server. They share the same Agents runtime — so a background task can call MCP tools, dispatch subagents, exactly like a sync request.
  • No special shutdown handling yet. Workers die with the process. A graceful drain that lets in-flight tasks finish before exit is on the roadmap.

Triggers

A trigger is a way to start an agent without anyone making an HTTP request. Cron fires on a schedule; webhooks fire on an inbound POST; boot triggers fire once when Coulisse starts. All three convert to the same shape — a task enqueued via the queue — so the agent runtime doesn't know or care how it was summoned.

This is the primitive that makes Coulisse feel like an office instead of a request handler: agents wake up because something happened, not because someone is waiting.

Why this is platform-agnostic

There's no chat-platform-specific code in Coulisse. The webhook trigger (coming next) accepts JSON POSTs from anything that can speak HTTP. Connecting Slack means pointing Slack's built-in outgoing webhooks at Coulisse. Connecting GitHub means setting up a webhook on the repo. Anything else that can POST JSON can summon an agent the same way. Coulisse doesn't know the source — it sees an HTTP request.

The cron trigger is purely internal — zero external dependencies.

Cron triggers

Configure under the top-level triggers: list in coulisse.yaml:

triggers:
  - name: daily-standup
    type: cron
    schedule: "0 9 * * *"      # every day at 09:00
    agent: pm
    prompt: "Standup matin — résume l'activité d'hier en 5 puces."

  - name: hourly-watch
    type: cron
    schedule: "0 * * * *"       # every hour at :00
    agent: user-tester
    prompt: "Une phrase sur le ressenti du moment."

Fields:

  • name — stable identifier used in logs and admin views. Must be unique within the file.
  • type: cron — the discriminator. Other types (webhook) arrive later.
  • schedule — POSIX cron expression. Either 5-field (min hour day-of-month month day-of-week) or 6-field with leading seconds (sec min …). The 5-field form is normalised to 6-field with a leading 0 seconds. Schedules are validated at startup; bad expressions refuse to boot.
  • agent — name of the agent (or experiment) to invoke. Must exist in agents: / experiments:.
  • prompt — static user message passed to the agent on each fire. Templating from trigger payload arrives with the webhook trigger.

When the trigger fires, Coulisse enqueues a task and a worker runs the agent through the same handler the sync /v1/chat/completions endpoint uses. The agent gets its full preamble, MCP tools, subagent dispatch, and narration — nothing about background runs is different. Watch them in /admin/live.

User identity

Cron-triggered tasks run as default_user_id (from the top of coulisse.yaml). If unset, they run as a synthetic cron user. Memory partitions are honoured: if daily-standup calls pm with default_user_id: main, it sees the same memory bucket as a human who sends a chat request as main.

Watching cron fire

Tail the log; you'll see one line per arm and one per fire:

INFO cron trigger armed   trigger=daily-standup agent=pm
INFO cron trigger fired   trigger=daily-standup agent=pm task_id=…

Or open /admin/live — tasks created by triggers appear in the Tasks panel the same way dispatch_task tasks do, with the trigger's prompt as the initial message and the agent name as written in YAML.

Boot triggers

A type: boot trigger fires exactly once when Coulisse starts. Use it for "wake up and decide what to do" prompts that should run on every coulisse start — e.g. asking an orchestrator agent to read the queue's leftovers and decide whether a standup is warranted, without forcing a ritual on every restart.

triggers:
  - name: wakeup
    type: boot
    agent: pm
    prompt: |
      You just came back online. Check `tasks_status` for what was running
      before the stop, look at recent commits, and decide whether to post
      a standup. Silence is fine when nothing demands attention.

Fields:

  • type: boot — discriminator.
  • agent, prompt — same as cron: which agent runs, with what initial message.

The task is enqueued during coulisse start, after the worker pool is up. Combined with the boot-time reaper that marks orphaned running tasks as errored, this gives the wake-up agent everything it needs to assess state and resume work — see Async tasks for the queue semantics.

Webhook triggers

A type: webhook trigger declares an HTTP path; Coulisse exposes POST <path> and fires the trigger on each request. This is the universal connector for outside systems — anything that can POST JSON can summon an agent. No chat-platform code in Coulisse.

triggers:
  - name: chat-mention
    type: webhook
    path: /hooks/chat-mention        # must start with /hooks/
    agent: pm
    prompt: "Message de {{sender}} dans {{room_name}} : {{body}}"

Fields beyond the cron shape:

  • type: webhook — discriminator.
  • path — HTTP path Coulisse exposes. Must start with /hooks/ to stay clear of the proxy (/v1/*), studio (/admin/*), and OAuth callbacks (/mcp/*). Must be unique across all webhook triggers.
  • agent — name of the agent (or experiment) to invoke. Accepts the same {{a.b.c}} templating as prompt, so one webhook can route to different agents based on the inbound payload (see Templated agent below).
  • prompt — template. {{a.b.c}} placeholders pull values from the JSON payload by dot-path. Missing paths render as the literal {{ a.b.c }} so debugging is obvious. Static prompts (no placeholders) work too — they pass through unchanged.

Fire it with curl:

curl -X POST http://localhost:8421/hooks/chat-mention \
  -H 'Content-Type: application/json' \
  -d '{"sender":"alice","room_name":"engineering","body":"@coulisse what is the state of the build?"}'

Response:

{ "ok": true, "task_id": "cb9b91c4-54db-4b8c-a564-08282e643c25" }

The task appears in /admin/live like any other.

Templated agent

The agent field accepts the same {{a.b.c}} templating as prompt. This lets one webhook fan out to different agents based on whatever the inbound payload carries — useful when a bridge POSTs one event per mentioned agent:

triggers:
  - name: chat-mention
    type: webhook
    path: /hooks/chat-mention
    agent: "{{agent}}"
    prompt: "@{{sender}} in #{{room}}: {{body}}"

The bridge does the iteration on its side and calls the same webhook N times, once per mentioned agent:

curl -X POST http://localhost:8421/hooks/chat-mention \
  -d '{"agent":"pm","sender":"almaju","room":"standup","body":"any release blockers?"}'

curl -X POST http://localhost:8421/hooks/chat-mention \
  -d '{"agent":"coder","sender":"almaju","room":"standup","body":"any release blockers?"}'

Two tasks land on the queue, one per agent.

A templated agent field is not cross-validated at config load — the value isn't known until a request arrives. If the resolved name doesn't match any agent, the worker errors the task with an "unknown agent" message; you'll see it in /admin/live. If the placeholder fails to resolve at all (the path is missing from the payload), the webhook returns 400 Bad Request and nothing is enqueued.

What's not here yet

  • Per-trigger user_id. Today every trigger fires as the same default_user_id. A future field will let triggers run as different synthetic users, useful for partitioning memory between scheduled jobs.
  • Skip-on-overlap. If a cron fires while the previous run is still going, both queue up. A skip_if_running: true field would let users opt into "only one at a time."
  • Signature verification on webhooks. Anyone who can reach /hooks/<path> can fire the trigger. For Internet-facing deployments you'd want a shared secret or HMAC check, configurable per trigger. Today the assumption is loopback or trusted network.

Sidecars

A sidecar is a long-lived external process Coulisse spawns alongside itself: a Slack listener, a custom metrics exporter, a bridge to whatever chat platform you use — anything you'd otherwise launch in a separate terminal.

The point is not to add new agent capabilities — agents already get the world via MCP. The point is to keep "one YAML, one start command" honest. If running Coulisse needs you to remember to also run a bridge script, that property has quietly broken.

Coulisse stays platform-agnostic. The sidecars mechanism only knows how to spawn a command, capture its output, and restart it on crash.

Declaring sidecars

sidecars:
  - name: chat-bridge
    command: chat-bridge/.venv/bin/python
    args: [chat-bridge/bridge.py]
    env:
      BOT_PASSWORD: coulisse-dev
    restart: on-failure

  - name: heartbeat
    command: /bin/sh
    args: ["-c", "while true; do echo alive; sleep 60; done"]
    restart: always

Fields:

  • name — stable identifier; appears in every log line emitted by or about the sidecar. Must be unique.
  • command — the executable. Absolute path or anything on PATH. No shell expansion — quote inside YAML if you need spaces.
  • args — argv entries, one per list item.
  • env — environment variables merged on top of Coulisse's own env. ${VAR} placeholders expand the same way the rest of coulisse.yaml expands them, so secrets don't have to be inlined.
  • cwd — working directory. Defaults to wherever you ran coulisse start.
  • restartalways / on-failure (default) / never. on-failure skips a clean exit (status code 0); the other two are self-explanatory.

What happens when a sidecar runs

  1. Coulisse spawns the command in a tokio task at startup.
  2. The sidecar's stdout and stderr are routed line-by-line into Coulisse's own tracing log, tagged with sidecar=<name> and stream=stdout|stderr. You'll see them next to MCP messages and request logs.
  3. When the process exits, Coulisse evaluates the restart policy and either backs off for two seconds and respawns, or stops watching the sidecar.
  4. There's no health check beyond "is the process still alive." If your sidecar hangs without exiting, Coulisse won't notice.

When not to use a sidecar

  • If the work is part of the agent flow, expose it as an MCP server instead — that's the abstraction agents actually use.
  • If the work is short-lived (a one-shot script), schedule it as a cron trigger that runs a small agent prompt instead.
  • If the work needs to outlive Coulisse (database, message broker, homeserver), don't manage it as a sidecar — run it under your real init system (systemd, docker, supervisord). Sidecars die with Coulisse.

Known limitations

  • Orphan processes on abrupt shutdown. Tokio's kill_on_drop sends SIGKILL to children when their Child handle drops, but if Coulisse itself is killed before the runtime can run those destructors, children get reparented to PID 1 and keep running. coulisse stop is a clean SIGTERM; in practice you may need pkill -f <command> to clean orphans up. A graceful-shutdown pass that explicitly SIGTERMs sidecars first is on the roadmap.
  • No retries-with-backoff. Crash-loop policy is fixed at two seconds. A sidecar that's permanently broken (typo in command, missing dependency) will respawn every two seconds forever.
  • No health checks. A hung sidecar that doesn't exit looks alive forever.
  • No admin surface. Sidecar state lives only in the log. A future /admin/sidecars page would show running / restart-count / last-output.

LLM-as-judge evaluation

Coulisse can score every agent reply with a separate LLM — a judge — and persist the result so you can track quality over time. You describe what to evaluate in the YAML rubric; Coulisse handles scoring shape, format, sampling, and storage.

This is useful for watching agent drift, comparing model/preamble changes, and catching regressions without standing up a separate evaluation pipeline.

How it works

  1. A client sends a chat request. The agent replies as usual — the judge never blocks the response.
  2. After the reply is persisted, Coulisse runs each judge the agent opted in to, in a background task.
  3. Each judge samples according to its sampling_rate (skip entirely if the draw misses), then asks its backing model to score the assistant's reply against every rubric at once.
  4. The response is parsed into one score row per rubric — persisted under the same user id as the conversation.
  5. Failures (bad JSON, provider error, timeout) are logged at warn and swallowed — the user already got their answer.

Scores are stored in the same SQLite database as messages and memories, in a scores table keyed by message_id. Averages are computed at read time, not aggregated on write.

YAML

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You are a helpful assistant.
    judges: [quality]              # opt in by name

  - name: translator
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: Translate into French.
    judges: [fluency]

judges:
  # Cheap, broad check — 100% of turns, small model.
  - name: quality
    provider: openai
    model: gpt-4o-mini
    sampling_rate: 1.0
    rubrics:
      accuracy:     Factual accuracy. Flag hallucinations.
      helpfulness:  Whether the assistant answered the user's question.
      tone:         Politeness and tone.

  # Targeted check for the translator — only 20% of turns.
  - name: fluency
    provider: openai
    model: gpt-4o-mini
    sampling_rate: 0.2
    rubrics:
      grammar:      Grammatical correctness of the French output.
      naturalness:  How native the phrasing sounds.

The wiring is visible from the agent: when you read an agent block you see which judges score it, rather than having to hunt through the judge list to figure out coverage.

Rubrics

A rubric is a map from criterion name to a short description of what to assess.

rubrics:
  accuracy:    Factual accuracy. Flag hallucinations.
  helpfulness: Whether the assistant answered the user's question.

Keep descriptions terse and assess-able. Don't write scale, format, or JSON instructions into them — Coulisse adds those internally. The description should tell the judge what matters, not how to answer.

Each criterion produces one Score row per scored turn, with its own numeric value and short reasoning. All criteria for one judge are evaluated in a single LLM call, so adding criteria to a judge doesn't multiply cost.

Scoring shape

Every score is an integer in 0..=10 with a one-sentence reasoning. Coulisse forces this shape through the preamble and parses the judge's JSON reply — you don't configure it.

If you need a different scale (e.g. boolean pass/fail, categorical), that will arrive as a future scale: field; the default stays numeric 0-10.

Sampling

sampling_rate controls what fraction of turns are scored.

ValueMeaning
1.0 (default)Score every turn.
0.1Roughly 10% of turns.
0.0Never score (useful to park a judge without deleting it).

The draw is independent per turn, per judge. Over many turns the scored fraction converges on the configured rate. Lower rates save tokens for expensive judges; broad cheap judges can run at 1.0.

Choosing a judge model

Pick a model that's different from the agent being scored whenever you can. A judge scoring its own output is biased — a cheap cross-provider judge (e.g. gpt-4o-mini judging a Claude agent, or vice versa) is usually closer to neutral.

Strong, slow models make sense for low-volume deep checks (sampling_rate: 0.1). Cheap, fast models make sense for high-volume broad checks (sampling_rate: 1.0).

Multiple judges per agent

Stack judges to get different dimensions at different cost points:

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    judges: [broad_check, deep_audit]

judges:
  - name: broad_check
    provider: openai
    model: gpt-4o-mini
    sampling_rate: 1.0
    rubrics:
      helpfulness: Whether the user's question was answered.
      tone:        Politeness and tone.

  - name: deep_audit
    provider: anthropic
    model: claude-opus-4
    sampling_rate: 0.05             # 5% of turns, expensive
    rubrics:
      accuracy:    Factual accuracy, including references and claims.
      safety:      Harmful, biased, or unsafe content.

Each judge is independent — its own model, rate, and rubric set. A turn can end up with zero, one, or both of these judges scoring it, depending on the sampling draw.

Viewing scores

The studio UI at /admin/ now shows a Scores panel per user. It surfaces two things:

  • Averages — mean score per (judge, criterion) across every turn the user has had, with sample count.
  • Recent — the most recent individual scores with reasoning.

Validation at startup

Coulisse fails fast on:

  • A judge referencing a provider that's not declared under providers:.
  • A judge with no rubrics.
  • A sampling_rate outside [0.0, 1.0].
  • An agent referencing a judge name that doesn't exist.

Any violation aborts startup with a message naming the offending judge or agent.

Cost control

Two knobs matter:

  1. sampling_rate — the easy one. Halve it, halve the judge bill.
  2. Judge model — the big one. A gpt-4o-mini judge at 100% sampling often costs less than a gpt-4o judge at 10%. Pick the cheapest model that gives you a stable signal.

A useful pattern is to run a cheap judge at 100% and a strong judge at a small fraction — the cheap one catches the broad signal, the strong one spot-checks the hardest cases.

Experiments (A/B testing)

Run multiple agent configurations under a single addressable name and let Coulisse pick which one serves each request. Useful for comparing models, preambles, or tool sets without changing client code.

How it works

  1. Define each candidate as a normal agent under agents:.
  2. Declare an experiment whose name is what clients send as model.
  3. List the candidate agents as variants and choose a strategy.

When a request arrives, the router resolves the experiment name to one variant (and optionally fires off shadow runs in the background). The variant choice is sticky-by-user by default, so the same user always lands on the same variant for a given experiment — conversation memory and persona stay consistent across turns.

Strategies

Three strategies are wired today: split, shadow, and bandit.

split

Weighted random sampling. Sticky by user when sticky_by_user: true (the default) — the variant is a deterministic hash of (user_id, experiment_name) modulo the cumulative weights, with no database writes. Adding or removing a variant reshuffles users.

agents:
  - name: assistant-sonnet
    provider: anthropic
    model: claude-sonnet-4-5-20250929
  - name: assistant-gpt
    provider: openai
    model: gpt-4o

experiments:
  - name: assistant            # what clients send as model
    strategy: split
    variants:
      - agent: assistant-sonnet
        weight: 0.5
      - agent: assistant-gpt
        weight: 0.5

shadow

Designate one variant as primary; it serves the user normally. The other variants run in the background against the same prepared context, are scored by their judges, and never write to the user's message history. The user never waits on shadow variants.

sampling_rate (default 1.0) controls how often shadow runs fire — set it lower to cap cost.

experiments:
  - name: assistant
    strategy: shadow
    primary: assistant-sonnet
    sampling_rate: 0.25       # 25% of turns also run the shadows
    variants:
      - agent: assistant-sonnet
      - agent: assistant-gpt

Use shadow to collect comparison data before flipping a split rollout — the primary serves all real traffic while you build up scoring evidence on the challenger.

bandit

Epsilon-greedy multi-armed bandit. Reads recent mean scores per variant from the existing scores table, picks the leader most of the time (1 - epsilon), and explores a random arm otherwise. Arms with fewer than min_samples recent scores are forced — the bandit only exploits once every arm has enough evidence.

agents:
  - name: assistant-sonnet
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    judges: [quality]
  - name: assistant-gpt
    provider: openai
    model: gpt-4o
    judges: [quality]

judges:
  - name: quality
    provider: openai
    model: gpt-4o-mini
    rubrics:
      helpfulness: Whether the assistant answered the user's question.

experiments:
  - name: assistant
    strategy: bandit
    metric: quality.helpfulness     # judge.criterion
    epsilon: 0.1
    min_samples: 30
    bandit_window_seconds: 604800   # 7 days
    variants:
      - agent: assistant-sonnet
      - agent: assistant-gpt

The configured judge (quality) and the criterion (helpfulness) must be declared on every variant agent — otherwise the bandit starves on that arm. Validation enforces this at startup.

A note on stickiness: with sticky_by_user: true (the default), the bandit decision is computed at request time via a deterministic hash of (user_id, experiment_name), so a given user typically lands on the same arm. Mean scores update as new data arrives, so a user can shift if a different arm overtakes the leader — that is the trade-off for keeping the assignment stateless.

Namespace and migration

Experiment names share a namespace with agent names. To A/B-test an existing agent without breaking clients:

  1. Rename the agent (assistantassistant-v1).
  2. Add a sibling agent (assistant-v2).
  3. Add an experiment named assistant with both as variants.

Clients keep sending model: assistant and it resolves transparently.

Variants stay individually addressable as agents under their own names (assistant-v1, assistant-v2) — useful for isolating one variant in tests or debugging.

Subagents

A subagent reference can name an agent or an experiment. If orchestrator lists subagents: [assistant] and assistant is an experiment, every subagent call resolves to a variant for the calling user, the same way a top-level request would. Sticky-by-user keeps the variant consistent across the whole conversation.

Give the experiment a purpose: if it's exposed as a subagent — it becomes the tool description the calling agent's LLM sees:

experiments:
  - name: assistant
    purpose: A general-purpose chat assistant.
    strategy: split
    variants:
      - agent: assistant-sonnet
      - agent: assistant-gpt

Bandit subagents read mean scores at call time, so the same exploit/explore behaviour applies inside subagent dispatch.

Telemetry

Each turn's TurnStart event includes agent (the resolved variant), and when an experiment was hit, experiment (the experiment name) and variant (same as agent). Judge scores are tagged with the variant's agent name in the database, so per-variant aggregation flows through the same table without a join — used by the bandit's mean-score query and the studio's per-variant view.

Studio

The studio shows configured experiments at /admin/experiments: strategy, sticky-by-user flag, and per-variant weight + share. For bandit experiments, the page additionally shows the configured metric, epsilon, and min-samples threshold, plus per-variant sample counts and mean scores (loaded inline via htmx from the judges admin endpoints). Shadow experiments call out the primary variant.

Validation

Coulisse rejects the following at startup:

  • Experiment name colliding with an agent name (rename one).
  • Experiment name colliding with another experiment.
  • Experiment with zero variants.
  • Variant referencing an undefined agent.
  • Variant weight <= 0.
  • Duplicate variant agent within one experiment.
  • Strategy-specific fields used with the wrong strategy (e.g. primary on a split experiment).
  • shadow without a primary, or with a primary that's not one of the variants.
  • shadow sampling_rate outside [0.0, 1.0].
  • bandit without a metric.
  • bandit metric that doesn't match an existing judge.criterion, or a variant that doesn't opt into the metric's judge.
  • bandit epsilon outside [0.0, 1.0].

Smoke tests

A smoke test is a synthetic-user persona that drives a conversation against one of your agents (or experiments). Coulisse plays the user — you write a preamble describing who they are and what they want — and the assistant replies for real. Every assistant turn flows through the same judge pipeline as production traffic, so you get a transcript and scores back without writing any harness code.

Smoke tests are most useful when you're iterating on a prompt: tweak the preamble, hit "Run now" in the studio, watch the scores. Pair them with experiments and a single click runs every variant once, sticky-by-user routing samples them across repetitions, and the judge scores feed straight into bandit selection.

How it works

  1. You trigger a run from the studio (/admin/smoke/<name>) — no client needed.
  2. Coulisse opens a fresh synthetic user id and starts a loop:
    • The persona model produces a "user" message — given the conversation so far with roles flipped (so the model speaks as the user).
    • The target agent replies as it normally would, with all its real MCP tools, subagents, and preambles.
    • The reply is fanned out to every judge the target agent opts into. Scores land in the same scores table as production runs, keyed by the assistant turn's id.
  3. The loop stops when either side emits the configured stop_marker, or when max_turns is hit.
  4. The full transcript is browsable at /admin/smoke/runs/<run_id> — assistant in slate, persona in amber.

Smoke runs never write to the user's memory or rate-limit windows. Each repetition uses a brand-new synthetic user id, so split/bandit experiments naturally sample variants across reps.

YAML

smoke_tests:
  - name: jobseeker_basic
    target: tremplin                 # agent or experiment name
    persona:
      provider: anthropic
      model: claude-haiku-4-5-20251001
      preamble: |
        You are role-playing a 28-year-old looking for a developer job in Paris.
        Reply like a real human: short questions, follow-ups as the conversation goes.
        When you have a satisfactory answer, finish with "[FIN]".
    initial_message: "Hi, I'm looking for work."
    stop_marker: "[FIN]"
    max_turns: 10
    repetitions: 5
FieldRequiredDefaultNotes
nameyesUnique within smoke_tests. Shows up at /admin/smoke/<name>.
targetyesAgent name or experiment name. Resolved through the experiment router per run.
personayesProvider, model, and preamble for the synthetic user.
initial_messagenoHard-coded first message from the persona. Skipping this lets the persona open the conversation.
stop_markernoSubstring that ends the run when emitted by either side.
max_turnsno10Cap on persona-then-agent pairs.
repetitionsno1Independent runs launched per "Run now" click. Each gets a fresh synthetic user id.

Iterating with experiments

Define two variants of an agent (e.g. assistant-v1, assistant-v2), wrap them in a bandit experiment, and target the experiment name from a smoke test:

experiments:
  - name: assistant
    strategy: bandit
    metric: quality.helpfulness
    variants:
      - agent: assistant-v1
      - agent: assistant-v2

smoke_tests:
  - name: convergence
    target: assistant
    repetitions: 50
    persona: { provider: openai, model: gpt-4o-mini, preamble: "..." }

Hit "Run now" once and the bandit accumulates 50 samples per variant per turn pair. The experiment page picks the winner on its own.

Limitations (today)

  • Smoke runs bypass the memory pipeline. Fact extraction and semantic recall are not exercised.
  • No scheduled runs — trigger is manual via the studio.
  • No tool-call assertions; assertions about what the agent did during a turn live in the judge rubrics.

Telemetry

The telemetry: block controls observability — what Coulisse logs to stderr, what it persists to SQLite for the studio UI, and whether it ships traces to your own OpenTelemetry backend.

Every field has a sensible default. Omit the block and you get stderr logs at info plus the studio's per-turn event tree, with no external traces.

Shape

telemetry:
  fmt:
    enabled: true        # stderr logs; default on
  sqlite:
    enabled: true        # mirrors spans into the studio's tables; default on
  otlp:                  # absent = disabled (default)
    endpoint: "http://localhost:4317"
    protocol: grpc       # or http_binary
    service_name: coulisse
    headers:
      authorization: "Bearer ${OTEL_API_KEY}"

All three layers compose. Turn sqlite off if you don't need the studio. Add otlp to ship the same traces to Grafana, SigNoz, Jaeger, Honeycomb, or any OTLP-compatible backend.

telemetry.fmt

FieldTypeRequiredNotes
enabledboolnoDefault true.

Writes structured logs to stderr. The level is controlled by the RUST_LOG environment variable; without it, the default is info,sqlx=warn (info from Coulisse, warnings only from the SQL driver). To see internal SQL traffic, run with RUST_LOG=debug. To silence everything, set RUST_LOG=error.

telemetry.sqlite

FieldTypeRequiredNotes
enabledboolnoDefault true.

Mirrors turn and tool_call tracing spans into the events and tool_calls tables that the studio UI reads. Without this layer, the studio loses its per-turn event tree and tool-call panel.

The schema is part of the same SQLite file the rest of Coulisse persists into (.coulisse/coulisse-memory.db).

telemetry.otlp

Absent (the default) means Coulisse does not export traces externally. To plug Coulisse into your own observability stack, set the block:

FieldTypeRequiredNotes
endpointstringyesCollector URL.
protocolenumnogrpc (default) or http_binary.
service_namestringnoOpenTelemetry resource attribute service.name. Default coulisse.
headersmapnoStatic HTTP/gRPC headers attached to every export.

Endpoint defaults

  • gRPC (the default): port 4317, e.g. http://localhost:4317.
  • HTTP/protobuf: port 4318, e.g. http://localhost:4318/v1/traces.

The collector you point at decides the rest — Coulisse ships traces with service.name = coulisse and span names turn, tool_call, and llm_call. Span fields carry user_id, turn_id, agent, tool_name, kind, and the rest documented in the features chapter.

Headers

Useful for managed backends:

telemetry:
  otlp:
    endpoint: "https://ingest.us.signoz.cloud:443"
    protocol: grpc
    headers:
      "signoz-access-token": "${SIGNOZ_TOKEN}"

YAML doesn't expand ${...} itself; substitute at deploy time (helm, envsubst, sops, etc.).

How the layers compose

The cli installs a single tracing_subscriber registry with the layers your config asked for, in order:

  1. RUST_LOG env filter
  2. fmt → stderr (when fmt.enabled)
  3. sqliteevents + tool_calls tables (when sqlite.enabled)
  4. otlp → external collector (when otlp is set)

Every span emitted by the running server fans out to all enabled layers. There is no priority or fallback — the SQLite layer keeps full payloads (full prompts, args, results), the OTLP layer ships the same span attributes to your collector. If your backend chokes on multi-megabyte attributes, drop those fields in your collector pipeline rather than at the source.

Telemetry

Coulisse emits its own observability via the tracing crate. Every request opens a turn span; every tool invocation (MCP or subagent) opens a child tool_call span. The configured layers — fmt, SQLite, and optionally OTLP — receive those spans and route them where you've asked for.

The result: the studio UI gives you an offline audit trail, and any OpenTelemetry-compatible backend (Grafana, SigNoz, Jaeger, Honeycomb, ...) gives you live traces. They're driven from the same source — there's no separate path.

Span model

Span nameOpened whenFields
turna chat completion request arrivesagent, experiment (when applicable), turn_id, user_id, user_message
tool_callan MCP or subagent tool firesargs, error (on failure), kind (mcp/subagent), result, tool_name
llm_calla chat completion finishes (token usage is known)cost_usd (when the model is in the pricing table), model, provider, usage

turn is the root; tool_call and llm_call nest under it via the tracing span tree, so OTLP backends render them as a trace tree out of the box.

Studio integration

When telemetry.sqlite.enabled is true (the default), the studio's per-turn event tree and tool-call panel render directly from the same spans. Nothing extra to wire up — open /admin/ and the tree is there.

OTLP backends

Set telemetry.otlp.endpoint to start exporting. The exporter batches spans, retries on transient failures, and shuts down cleanly on process exit so in-flight spans land before the server stops.

Tested with:

  • Grafana (Tempo / Cloud) — gRPC at 4317.
  • SigNoz (self-hosted or Cloud) — gRPC; for Cloud add a signoz-access-token header.
  • Jaeger — gRPC at 4317 (Jaeger ≥ 1.50 speaks OTLP natively).
  • Honeycomb — HTTP/protobuf at https://api.honeycomb.io/v1/traces with x-honeycomb-team header.

Tuning verbosity

The fmt layer (stderr logs) is controlled by RUST_LOG:

RUST_LOG=info,sqlx=warn coulisse        # default
RUST_LOG=debug coulisse                 # verbose, including SQL driver
RUST_LOG=warn coulisse                  # quiet
RUST_LOG=coulisse=debug,agents=trace coulisse   # per-crate filtering

The SQLite and OTLP layers are not affected by RUST_LOG — they capture every turn / tool_call / llm_call span regardless of log level.

Disabling layers

Each layer has its own enabled flag. Common combinations:

# Production with external observability stack
telemetry:
  sqlite:
    enabled: false      # studio not exposed; no need to keep DB rows
  otlp:
    endpoint: "..."
# Local development, no external backend
telemetry:
  # default fmt + sqlite
# CI / load tests — minimize logging overhead
telemetry:
  fmt:
    enabled: false
  sqlite:
    enabled: false

CLI reference

Coulisse ships as a single binary with a handful of subcommands. Every subcommand accepts -c, --config <PATH> (default coulisse.yaml) and honors the COULISSE_CONFIG env var as a fallback.

State files (coulisse.pid, coulisse.log) live in a .coulisse/ directory next to the config file — this keeps state co-located with the project and makes cd && coulisse stop "just work."

coulisse init

Write a starter coulisse.yaml in the current directory.

coulisse init                 # minimal template (one OpenAI agent + sqlite memory)
coulisse init --from-example  # full annotated example (every section, every option)
coulisse init --force         # overwrite an existing coulisse.yaml

coulisse start

Start the server, detached by default. Returns once the server has written its PID file (or fails if the boot times out within 5 seconds).

coulisse start                # detached background server
coulisse start --foreground   # attached: logs stream to the terminal
coulisse start -F             # short form

A bare coulisse invocation is equivalent to coulisse start --foreground — the historical pre-subcommand behavior is preserved.

When detached, stdout/stderr are appended to .coulisse/coulisse.log.

coulisse stop

Send SIGTERM to a running detached server (PID read from .coulisse/coulisse.pid).

coulisse stop          # graceful: SIGTERM, wait up to 10s
coulisse stop --force  # SIGKILL (use if the server is wedged)

Stop is a no-op if the server isn't running — stale PID files left over from crashes are detected and removed.

coulisse restart

Equivalent to coulisse stop && coulisse start.

coulisse reset

Delete the SQLite database, wiping all stored state — conversation memory, long-term memories, telemetry, judge scores, rate-limit windows, background tasks, and API tokens. Your coulisse.yaml is never touched.

Destructive and irreversible, so it refuses to run while a server holds the database open (stop it first), and prompts for confirmation unless -y is passed. Removes the database file (.coulisse/coulisse-memory.db) plus its -wal/-shm sidecars.

coulisse reset       # warns, lists the files, asks to confirm
coulisse reset -y    # skip the prompt (for scripts / fresh starts)

coulisse status

Report whether the detached server is running and where its files live.

running (pid 31427)
  config: ./coulisse.yaml
  log:    ./.coulisse/coulisse.log

coulisse studio

Open the studio UI (/admin/) in the default web browser. Requires the server to be running — start it first with coulisse start.

coulisse studio   # also: coulisse admin
# opening http://localhost:8421/admin/

The URL honors server.port from coulisse.yaml, so multiple Coulisse instances on different ports each open their own studio.

coulisse token

Mint, list, and revoke the self-issued API tokens that gate /v1/* when auth.proxy.tokens is enabled. Operates on the same database the running server uses, so changes are live immediately.

coulisse token create laptop --principal alice         # unlimited
coulisse token create ci --principal alice \
  --budget monthly --limit 20                          # $20 / month cap
coulisse token list                                    # tokens + spend
coulisse token revoke <id>                             # immediate 401 for clients

create prints the secret (sk-coulisse-…) to stdout — shown only once — and the id/context to stderr, so coulisse token create … > key.txt captures just the key.

coulisse check

Load and validate the YAML without starting the server. Catches schema errors and cross-reference issues (agent → provider, agent → judge, experiment variant → agent, ...) before a real start.

coulisse check
# ok — coulisse.yaml (3 agents, 1 judges, 0 experiments, 2 providers)

coulisse schema

Emit the JSON Schema for coulisse.yaml to stdout. Redirect to a file next to your config and reference it for IDE autocompletion and validation:

coulisse schema > coulisse.schema.json
# yaml-language-server: $schema=./coulisse.schema.json

Picked up by the VS Code YAML extension, Helix, Neovim, Zed, JetBrains — anything that speaks the yaml-language-server directive. The schema is generated from the same Rust types that parse the config, so it never drifts.

coulisse update

Fetch the latest release from GitHub and replace the running binary in place. Detects the host target triple (e.g. aarch64-apple-darwin) and downloads the matching cargo-dist artifact. No-op if you're already on the latest version.

coulisse update
# checking for updates...
# updated to 0.2.0

The binary needs write permission to its own path — if you installed under /usr/local/bin you may need sudo.

State directory layout

your-project/
├── coulisse.yaml
└── .coulisse/
    ├── coulisse.pid          # written by `start`, removed on clean exit
    ├── coulisse.log          # detached stdout/stderr
    ├── secrets.env           # MCP OAuth encryption keys (when configured)
    ├── files/                # uploaded file blobs (fs storage backend)
    └── coulisse-memory.db    # SQLite database

.coulisse/ holds the whole runtime footprint of one project under a single directory: the SQLite database, uploaded files, logs, PID, and secrets all land here, and the paths are not configurable. Mount this one directory to persist Coulisse's state in Docker.

HTTP API

Coulisse listens on 0.0.0.0:8421 and exposes an OpenAI-compatible surface.

POST /v1/chat/completions

The main chat endpoint. Accepts the standard OpenAI chat completion request shape.

Request

{
  "model": "assistant",
  "safety_identifier": "user-123",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}
FieldRequiredNotes
messagesyesThe usual OpenAI message array. At least one user message is required.
metadatanoOptional map of strings. Used for per-request rate limits — see below.
modelyesName of an agent from your config.
response_formatnoPin the reply shape: {"type": "json_object"} or {"type": "json_schema", "json_schema": {…}}. Validated and enforced for every provider — see Structured outputs.
safety_identifieryes¹Identifies the user. Can be any stable string.
streamnoWhen true, the response is an SSE stream of chat.completion.chunk frames. See Streaming responses.
stream_optionsnoObject. include_usage: true adds the usage field to the terminal stream chunk.
userDeprecated OpenAI field; accepted as a fallback.

¹ Required unless a default_user_id is set in coulisse.yaml — see User identification.

Recognized metadata keys

metadata is a passthrough map of strings. Coulisse interprets the following keys; any other keys are ignored.

KeyTypeMeaning
languageBCP 47 tagForces the response language, e.g. fr-FR. See Response language.
tokens_per_dayinteger (as string)Max tokens per rolling day.
tokens_per_hourinteger (as string)Max tokens per rolling hour.
tokens_per_monthinteger (as string)Max tokens per rolling 30-day window.

All optional. See Rate limiting for the token-limit behavior.

Response

Standard OpenAI chat completion response:

{
  "id": "...",
  "object": "chat.completion",
  "created": 1714000000,
  "model": "assistant",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "Hi!"},
      "finish_reason": "stop"
    }
  ]
}

Streaming

Set stream: true to receive chat.completion.chunk frames over Server-Sent Events instead of one JSON response. The full wire format and disconnect semantics live in Streaming responses.

Errors

Errors come back in OpenAI's error shape:

{
  "error": {
    "type": "invalid_request_error",
    "message": "safety_identifier is required",
    "code": null
  }
}

Common cases:

  • 400 — missing safety_identifier (when required), no user message, unknown agent name, unparseable metadata values, a malformed response_format JSON Schema.
  • 429 — per-user token limit exceeded. Includes a Retry-After header with seconds until the window resets. See Rate limiting.
  • 5xx — upstream provider error, MCP server failure, a response_format reply that never validated after repair retries. See Structured outputs.

GET /v1/models

Lists every agent defined in the config.

Response

{
  "object": "list",
  "data": [
    {"id": "assistant", "object": "model", "owned_by": "coulisse"},
    {"id": "code-reviewer", "object": "model", "owned_by": "coulisse"}
  ]
}

Useful for UI dropdowns that want to populate a model picker from the server.

Admin / config endpoints

Everything under /admin/* is a single content-negotiated surface. The same routes serve HTML pages to browsers, HTML fragments to htmx, and JSON to scripts — set Accept: application/json (or send an HX-Request header) to switch representation. Request bodies are equally tolerant: application/json, application/yaml, and application/x-www-form-urlencoded all deserialize into the same target type.

All admin routes are gated by the auth.admin scope.

Agents

MethodPathBodyNotes
GET/admin/agentsList configured agents (HTML or JSON).
POST/admin/agentsAgentConfigCreate a new agent. 409 if the name is taken.
GET/admin/agents/{name}Detail (HTML or JSON).
PUT/admin/agents/{name}AgentConfigReplace the named agent. Body name must match URL.
DELETE/admin/agents/{name}Remove the named agent.
GET/admin/agents/newHTML form for a new agent.
GET/admin/agents/{name}/editHTML edit form.

AgentConfig is the same shape used in coulisse.yaml: name, provider, model, preamble, purpose (optional), judges (list, optional), subagents (list, optional), mcp_tools (list, optional).

Judges, experiments, providers, MCP servers

Same CRUD shape as agents — list / create / one / update / delete. Adjust the path to suit:

PathBodyNotes
/admin/judges + /admin/judges/{name}JudgeConfigLLM-as-judge evaluators.
/admin/experiments + /admin/experiments/{name}ExperimentConfigA/B routing groups. The runtime ExperimentRouter rebuilds on restart; admin display reflects the file in real time.
/admin/providers + /admin/providers/{kind}ProviderConfig (just api_key); POST body adds kindWhere {kind} is one of anthropic, cohere, deepseek, gemini, groq, openai. The runtime client is built at boot — restart to swap.
/admin/mcp + /admin/mcp/{name}McpServerConfig (transport: stdio + command/args/env, or transport: http + url); POST body adds nameConnections open at boot — restart to attach a new server.

Whole-file config

MethodPathBodyNotes
GET/admin/configReturns the file contents (application/yaml by default, JSON when Accept: application/json).
PUT/admin/configfull YAML/JSONReplaces coulisse.yaml atomically. Validation runs before any disk write.
GET/admin/openapi.jsonOpenAPI 3.1 description of every admin route, including request/response schemas. Feed it to openapi-generator or any client codegen for typed SDKs.

Validation, hot reload, the file watcher

Every write — admin form save, JSON PUT, hand-edit in $EDITOR — flows through the same pipeline:

  1. The body is merged into the on-disk YAML (preserving sections this binary doesn't recognize).
  2. The full result is deserialized into a Config and run through cross-feature validation (provider references, judge references, experiment variants, …).
  3. Only on success does anything touch disk: a temp file is written and renamed atomically.
  4. The file watcher fires, the new config is reloaded, and feature crates' hot-reloadable state (agent list, judges list, experiments list, settings view) atomically swaps in.

Errors return the validator's message verbatim with a 422 Unprocessable Entity (or 400 for malformed bodies). The on-disk file is unchanged when validation rejects a write.

The studio UI is just one client of these endpoints — see Studio UI for what the rendered surface offers and authentication options.

Auth

By default Coulisse leaves /v1/* open. Configure the auth.proxy scope in YAML to require Basic credentials or OIDC for SDK clients; configure auth.admin to gate the studio. See Studio UI for the schema. Anything you don't gate is your responsibility to terminate at the infrastructure layer (reverse proxy, API gateway, VPN).

YAML schema

A complete reference for every field in coulisse.yaml.

IDE autocompletion and validation

Coulisse derives a JSON Schema from the Rust types that parse the YAML, so your editor can autocomplete and lint the config live. Generate the schema next to your config:

coulisse schema > coulisse.schema.json

Then reference it from the top of coulisse.yaml with the yaml-language-server directive (recognised by the VS Code YAML extension, Helix, Neovim, Zed, JetBrains, etc.):

# yaml-language-server: $schema=./coulisse.schema.json

The schema is also shipped at the repo root as coulisse.schema.json and is the single source of truth for the field tables below — they describe the same shape in prose.

Environment variables

Any string value in coulisse.yaml can reference an environment variable with ${VAR_NAME}:

providers:
  anthropic:
    api_key: ${ANTHROPIC_API_KEY}
  openai:
    api_key: ${OPENAI_API_KEY}

Coulisse expands all ${...} placeholders before parsing the YAML, so substitution works in any field — API keys, URLs, tokens, passwords, MCP env blocks, etc.

If a referenced variable is not set in the environment, the server refuses to start and prints an error naming the missing variable. An unclosed ${ with no matching } is also rejected at startup.

Config variables

Named text snippets declared under a top-level vars: block and spliced into other string fields with ${vars.<name>}. Useful for sharing a preamble footer across agents, repeating a path, or factoring any string that would otherwise duplicate.

vars:
  team_footer: |
    Team: @pm, @coder, @qa
    Rooms: #standup, #engineering, #worklog

agents:
  - name: pm
    provider: anthropic
    model: claude-opus-4-7
    preamble: |
      You are the PM.
      ${vars.team_footer}
  - name: coder
    provider: anthropic
    model: claude-sonnet-4-6
    preamble: |
      You are the coder.
      ${vars.team_footer}

${vars.<name>} is resolved after environment-variable expansion, so a var's value can itself contain ${VAR} references. Substitution is single-pass: a substituted value containing ${vars.x} is not re-expanded. Unknown ${vars.x} references abort startup with the offending line.

Multi-line var values inherit the placeholder's leading indent — every line after the first gets prefixed with the same whitespace as the line containing ${vars.x}. This lets a snippet splice cleanly into a YAML block scalar (preamble: |) without breaking the indentation contract.

Top-level

agents: [ ... ]               # required, non-empty
auth: { ... }                 # optional; per-scope auth for /v1/* and /admin/*
default_user_id: <string>     # optional, unset by default
experiments: [ ... ]          # optional; A/B test groups over agents
judges: [ ... ]               # optional; empty/omitted = no evaluation
mcp: { ... }                  # optional
memory: { ... }               # optional; defaults to sqlite history, no long-term memory
providers: { ... }            # required
public_base_url: <string>     # optional; used for MCP OAuth redirect URIs (default: http://localhost:{port})
server: { ... }               # optional; bind/port/threads/body-limit (defaults to 0.0.0.0:8421)
sidecars: [ ... ]             # optional; long-lived helper processes Coulisse spawns alongside itself
skills: { ... }               # optional; skill directory (defaults to ./skills)
smoke_tests: [ ... ]          # optional; synthetic-user evaluation runs
storage: { ... }              # optional; file upload backend (default: fs, no quota)
telemetry: { ... }            # optional; fmt + sqlite on by default, OTLP opt-in
triggers: [ ... ]             # optional; cron / webhook / boot
vars: { name: value, ... }    # optional; named snippets referenced via ${vars.<name>}

auth

  • Type: object
  • Optional. Omit to leave both surfaces unauthenticated (fine for local dev, never for anything exposed beyond loopback).

Two independent scopes:

  • auth.proxy guards the OpenAI-compatible /v1/* surface that SDK clients call.
  • auth.admin guards the /admin/* surface (the studio UI).

Each scope is itself optional and accepts the same shape: exactly one of basic, oidc, or tokens when present (tokens on the proxy scope only). They are mutually exclusive within a scope — the server rejects a scope block that has more than one or none. The two scopes are independent, so you can enable Basic on one and OIDC on the other.

auth.<scope>.basic

Static HTTP Basic credentials. Best for local dev or a single-operator deployment.

FieldTypeRequiredDefaultNotes
passwordstringyesNon-empty. Rotate if suspected leaked — there's no token revocation.
usernamestringnoadminNon-empty when set.
auth:
  admin:
    basic:
      password: choose-something-strong
      username: admin

auth.<scope>.oidc

Authorization-code-with-PKCE login against an OIDC-compliant IdP (Authentik, Keycloak, Auth0, Google, etc.). Access control is delegated to the IdP's application policy — Coulisse accepts any successfully authenticated user.

FieldTypeRequiredDefaultNotes
client_idstringyesMust match the client registered at the IdP.
client_secretstringnoRequired for confidential clients (Authentik's default); omit for public clients using PKCE only.
issuer_urlstringyesIdP issuer. For Authentik: https://<host>/application/o/<app-slug>/.
redirect_urlstringyesPublic base URL inside the protected scope. Must be registered as the redirect URI at the IdP. axum-oidc allows every subpath of this URL as a valid redirect.
scopeslist<string>no[email, profile]Extra OAuth2 scopes. openid is added automatically.
auth:
  admin:
    oidc:
      issuer_url:    https://authentik.example.com/application/o/coulisse/
      client_id:     coulisse-admin
      client_secret: <secret>
      redirect_url:  http://localhost:8421/admin/

auth.proxy.identity

How the per-user identity that partitions memory, recall, MCP sessions, and rate limits is derived. Only valid on the proxy scope — the admin surface has no per-user partitioning, so from_credential there is rejected at startup.

ValueBehavior
from_requestDefault. Trust the safety_identifier (or deprecated user) field in the request body. Correct for single-user setups and trusted first-party backends that set the identifier on behalf of their own authenticated users.
from_credentialDerive the identity from the authenticated principal — the Basic username or the OIDC sub claim. A request body claiming a different safety_identifier is rejected with 403. Use this for adversarial multi-tenant serving, where clients cannot be trusted to declare their own identity.

from_credential requires auth.proxy to declare basic or oidc (you can't bind to a credential that isn't checked), and is mutually exclusive with default_user_id — a shared default bucket would bypass the binding. With Basic, every distinct user needs distinct credentials, since the username is the identity; OIDC gives each user a distinct sub automatically.

auth:
  proxy:
    oidc:
      issuer_url:    https://authentik.example.com/application/o/coulisse/
      client_id:     coulisse-proxy
      client_secret: <secret>
      redirect_url:  http://localhost:8421/v1/
    identity: from_credential   # user = the OIDC subject, not the request body

auth.proxy.tokens

Self-issued API tokens — Coulisse mints sk-coulisse-… bearer keys, stores only their hash, and gates /v1/* on them. Set the (currently empty) block to turn the scheme on; tokens are then created at runtime, never in YAML:

auth:
  proxy:
    tokens: {}   # enable bearer-token auth on /v1/*

Clients authenticate exactly like the OpenAI API: Authorization: Bearer sk-coulisse-…. Each token binds to a principal (the user id that partitions memory, recall, and rate limits), so token auth always implies credential-bound identity — a request body claiming a different safety_identifier is rejected with 403, and default_user_id does not apply.

Mint, monitor spend on, and revoke tokens from the studio's Tokens page or the coulisse token CLI. Each token carries a budget — unlimited, a lifetime cap, or a per-calendar-month cap; a request that would exceed it is rejected with 429 insufficient_quota before any provider call. See API tokens.

default_user_id

  • Type: string
  • Default: unset
  • Purpose: fallback identifier for requests that don't supply safety_identifier (or the deprecated user).

Leave it unset for multi-tenant deployments — unidentified requests will be rejected. Set it to something like "main" for local or single-user setups so memory still works whether or not the client bothers to send an id. See User identification.

providers

  • Type: map of provider_kind → provider_config
  • Required. At least one provider must be declared.

Supported keys

anthropic, cohere, deepseek, gemini, groq, openai.

Per-provider fields

FieldTypeRequiredNotes
api_keystringyesProvider API key.
providers:
  anthropic:
    api_key: ${ANTHROPIC_API_KEY}
  openai:
    api_key: ${OPENAI_API_KEY}

mcp

  • Type: map of server_name → server_config
  • Optional. Omit if you don't use tools.

Server names are arbitrary — they're what agents refer to under mcp_tools.

A server is either remote (declare a url:) or local (declare a command:). The transport is inferred — a url: is HTTP, or SSE if the path contains /sse; a command: is stdio — but you can pin it with an explicit transport:.

Common fields

FieldTypeRequiredNotes
transportenumnohttp, sse, or stdio. Inferred from url/command when omitted; set it to force a transport (e.g. sse on a URL without /sse).

Remote (url)

FieldTypeRequiredNotes
urlstringyesMCP endpoint. HTTP, or SSE when the path contains /sse.
oauthvariesnoPer-user OAuth is on by default for URL-based servers (discover mode). Set false to disable on a no-auth server, { scopes: [...] } to override scopes, or a full { mode: static, ... } block for providers without Dynamic Client Registration. See Per-user OAuth for MCP servers.

Local (command)

FieldTypeRequiredNotes
commandstringyesExecutable to run (stdio transport).
argslist<str>noCommand-line arguments.
envmap<str,str>noEnvironment variables for the child.

Examples

mcp:
  hello:
    command: uvx
    args: [--from, git+https://..., hello-mcp-server]

  calculator:
    url: http://localhost:8080
    oauth: false                 # no-auth server, skip the connect flow

  todoist:
    url: https://ai.todoist.net/mcp   # per-user OAuth implied

memory

  • Type: object
  • Optional. Omit for defaults: SQLite at .coulisse/coulisse-memory.db, history-only (no long-term user state).

See Memory configuration for the full walkthrough and examples.

Sub-fields

The database always lives at .coulisse/coulisse-memory.db; its location is not configurable. The only sub-field is user_state.

FieldTypeRequiredDefault
user_statebool or objectnofalse
user_state.embed_withobjectnoauto-picked from providers:
user_state.learn_fromobjectnoauto-picked from providers:
user_state.dedup_thresholdfloatno0.9
user_state.max_facts_per_turnintno5
user_state.recall_kintno5

agents

  • Type: list of agent configs
  • Required. At least one agent must be defined.

Per-agent fields

FieldTypeRequiredNotes
namestringyesUnique agent identifier; clients pass this as model.
providerstringyesKey under providers.
modelstringyesUpstream model identifier.
preamblestringnoSystem prompt. Default: empty.
judgeslist<string>noNames of judges (from top-level judges:) that evaluate this agent's replies. Empty = no evaluation.
max_turnsintegernoMaximum tool-calling rounds per turn. Default: 8. Raise for agents that chain many tool calls (e.g. a coder that reads files, edits, and dispatches to QA in one go).
mcp_toolslist<mcp_tool_access>noTools this agent may use.
purposestringnoTool description when this agent is exposed via another agent's subagents. Omit for standalone agents; add a concrete one-line description when this agent is meant to be called as a specialist.
skillslist<string>noNames of skills (from the top-level skills: directory) this agent may use. Each becomes a tool advertised by its description; calling it returns the skill's instructions. Unknown names are ignored. See Skills.
subagentslist<string>noNames of other agents exposed as callable tools. Each entry must refer to another entry under agents. Self-reference and duplicates are rejected at startup.

mcp_tools entry

FieldTypeRequiredNotes
serverstringyesKey under mcp.
onlylist<str>noAllowed tool names. Omit for full access.

Complete agent example

agents:
  - name: code-reviewer
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: |
      You are a thorough code reviewer.
    mcp_tools:
      - server: filesystem
        only:
          - read_file
      - server: hello

Subagent example

agents:
  - name: resume_critic
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    purpose: Critique and rewrite a resume for a target role.
    preamble: |
      Given a resume and a target role, return a revised resume
      and a bullet list of the biggest gaps.

  - name: coach
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    subagents: [resume_critic]
    preamble: |
      Delegate resume work to `resume_critic` when relevant.

See Multi-agent routing for the full subagent walkthrough.

experiments

  • Type: list of experiment configs
  • Optional. Omit (or leave empty) to skip A/B testing.

An experiment wraps two or more agents under one addressable name. Clients send the experiment's name in the model field and the router picks a variant per request. Experiment names share the agent namespace — collisions are rejected at startup.

See Experiments for the end-to-end walkthrough.

Per-experiment fields

FieldTypeRequiredDefaultNotes
bandit_window_secondsintno (bandit)604800 (7 d)Bandit-only. Maximum age of scores included in mean-arm computations.
epsilonfloatno (bandit)0.1Bandit-only. Probability in [0.0, 1.0] of routing to a random arm instead of the leader.
metricstringyes (bandit)Bandit-only. judge.criterion to optimise. The judge must declare the criterion in its rubrics, and every variant must opt into the judge.
min_samplesintno (bandit)30Bandit-only. Each arm must accumulate this many scores before exploitation is allowed.
namestringyesAddressable name; must not collide with any agent name.
primarystringyes (shadow)Shadow-only. Variant agent that serves the user. Must be one of variants.
purposestringnoTool description when the experiment is exposed via another agent's subagents:.
sampling_ratefloatno (shadow)1.0Shadow-only. Probability in [0.0, 1.0] that a turn also runs the non-primary variants in the background.
sticky_by_userboolnotrueWhen true, the same user always lands on the same variant (deterministic hash, no DB writes).
strategyenumyessplit, shadow, or bandit.
variantslist<variant>yesNon-empty. Each entry references an agent.

variants entry

FieldTypeRequiredDefaultNotes
agentstringyesName of an agent declared under top-level agents:. Variants must reference concrete agents — nesting an experiment is rejected.
weightfloatno1.0Strictly positive. Normalised against the sum of all variant weights.

Example

agents:
  - name: assistant-sonnet
    provider: anthropic
    model: claude-sonnet-4-5-20250929
  - name: assistant-gpt
    provider: openai
    model: gpt-4o

experiments:
  - name: assistant
    strategy: split
    variants:
      - agent: assistant-sonnet
        weight: 0.5
      - agent: assistant-gpt
        weight: 0.5

judges

  • Type: list of judge configs
  • Optional. Omit (or leave empty) for no automatic evaluation.

Judges are background LLM-as-judge evaluators. An agent opts in by listing judge names in its own judges: field. See LLM-as-judge evaluation for the full walkthrough.

Per-judge fields

FieldTypeRequiredDefaultNotes
namestringyesUnique judge identifier; agents refer to it here.
providerstringyesMust match a key under providers.
modelstringyesUpstream model identifier for the judge call.
rubricsmap<string,string>yescriterion: short description of what to assess. One score row per criterion per scored turn. Must declare at least one entry.
sampling_ratefloatno1.0In [0.0, 1.0]. 1.0 = every turn, 0.1 ≈ 10%, 0.0 = never.

Rubric descriptions should say what to evaluate — don't include scale, JSON, or format instructions. Coulisse forces the output shape internally (integer 0-10 per criterion with a one-sentence reasoning).

Example

judges:
  - name: quality
    provider: openai
    model: gpt-4o-mini
    sampling_rate: 1.0
    rubrics:
      accuracy:     Factual accuracy. Flag hallucinations.
      helpfulness:  Whether the assistant answered the user's question.
      tone:         Politeness and tone.

server

  • Type: object
  • Optional. Omit the whole block for the defaults below.
  • Purpose: how the process binds and listens.
FieldTypeDefaultPurpose
bindstring (IP)0.0.0.0Interface to bind. Set 127.0.0.1 to accept loopback only (behind a reverse proxy or tunnel).
portinteger (u16)8421TCP port. Give each coulisse.yaml its own port when running multiple instances on one machine.
worker_threadsintegerCPU counttokio worker-thread count. Read once at startup; changing it requires a restart.
max_body_bytesintegeraxum 2 MiBLargest accepted request body. Raise for big attachment uploads; lower to harden a public endpoint.
server:
  bind: 0.0.0.0
  port: 8421
  worker_threads: 4
  max_body_bytes: 8388608   # 8 MiB

The port field moved here from the top level in this release. A bare top-level port: is no longer read — nest it under server:.

skills

  • Type: object
  • Optional. Omit the whole block to scan the default ./skills directory.
  • Purpose: where reusable skill bundles are loaded from.
FieldTypeDefaultPurpose
dirstring./skillsDirectory holding one subdirectory per skill, each with a SKILL.md. A missing directory yields no skills (not an error).
skills:
  dir: ./skills

Each subdirectory with a SKILL.md becomes a skill; agents opt in by listing skill names under their own skills: array. See Skills for the SKILL.md format, bundled resource files, and the skill_file tool.

smoke_tests

  • Type: list of smoke test configs
  • Optional. Omit (or leave empty) for no synthetic-user runs.

Each entry pairs a persona (an LLM that role-plays the user) with a target agent or experiment. Triggered from the studio at /admin/smoke/<name>. See Smoke tests for the workflow.

Per-test fields

FieldTypeRequiredDefaultNotes
namestringyesUnique within smoke_tests.
targetstringyesAgent or experiment name. Resolved per run via the experiment router.
personaobjectyesprovider, model, preamble for the role-played user.
initial_messagestringnoHard-coded first persona turn. Omit to let the persona open the conversation.
stop_markerstringnoSubstring that ends the run when emitted by either side.
max_turnsintegerno10Cap on persona-then-agent pairs per run.
repetitionsintegerno1Independent runs launched per click. Each gets a fresh synthetic user id.

Example

smoke_tests:
  - name: jobseeker_basic
    target: tremplin
    persona:
      provider: anthropic
      model: claude-haiku-4-5-20251001
      preamble: |
        You are a 28-year-old looking for a developer job in Paris.
        Reply like a real human; finish with "[FIN]" once satisfied.
    initial_message: "Hi, I'm looking for work."
    stop_marker: "[FIN]"
    max_turns: 10
    repetitions: 5

telemetry

  • Type: object
  • Optional. Omit and Coulisse runs with stderr fmt logs at info plus the SQLite mirror that drives the studio UI; no external traces.

The block has three sub-sections — fmt, sqlite, and otlp — each independently toggleable. See Telemetry configuration for the full schema and Telemetry & OpenTelemetry for span semantics and OTLP backend integration.

telemetry:
  fmt:
    enabled: true        # default
  sqlite:
    enabled: true        # default; powers the studio UI
  otlp:                  # absent = no external traces
    endpoint: "http://localhost:4317"
    protocol: grpc       # or http_binary
    service_name: coulisse
    headers:
      authorization: "Bearer ${OTEL_API_KEY}"

Validation

On startup, Coulisse checks:

  • All ${VAR_NAME} placeholders resolve to set environment variables.
  • Each present auth scope (proxy, admin) declares exactly one of basic or oidc.
  • auth.<scope>.basic.password and auth.<scope>.basic.username are non-empty.
  • auth.<scope>.oidc.client_id, issuer_url, and redirect_url are non-empty.
  • There is at least one agent.
  • Agent names are unique.
  • Every agent's provider is configured.
  • Every referenced MCP server is configured.
  • Every name in subagents refers to a defined agent or experiment.
  • No agent lists itself under subagents.
  • subagents entries are unique within an agent (no duplicates).
  • Experiment names are unique and do not collide with any agent name.
  • Each experiment declares at least one variant.
  • Each variant references a defined agent and has a strictly positive weight.
  • Variant agents within an experiment are unique.
  • Strategy-specific fields are only set on the matching strategy (e.g. primary only on shadow, metric only on bandit).
  • For shadow: primary is set and matches one of the variants; sampling_rate is in [0.0, 1.0].
  • For bandit: metric is judge.criterion; the judge exists, declares the criterion in its rubrics, and every variant opts into the judge; epsilon is in [0.0, 1.0].
  • Every referenced judge exists.
  • Judge names are unique.
  • Every judge's provider is configured and supported.
  • Every judge has at least one rubric.
  • Every judge's sampling_rate is in [0.0, 1.0].

Any violation fails fast with an error message that names the offending agent or judge and field.

Releasing

Coulisse follows Semantic Versioning. Pre-1.0, minor bumps may include breaking changes to the YAML schema, HTTP surface, or CLI; patch bumps will not.

Cutting a release

  1. Bump the version in the workspace Cargo.toml:

    [workspace.package]
    version = "0.2.0"
    

    All workspace crates inherit this via version.workspace = true, so this is the only place to edit.

  2. Update CHANGELOG.md — rename the ## [Unreleased] section to ## [0.2.0] - YYYY-MM-DD and start a fresh ## [Unreleased] block above it.

  3. Commit, tag, push:

    git commit -am "Release v0.2.0"
    git tag v0.2.0
    git push && git push --tags
    

The v*.*.* tag triggers two workflows:

  • release.yml (cargo-dist) — builds binaries and installers for macOS (x86 + ARM), Linux GNU (x86 + ARM), and Windows MSVC, then publishes them as a GitHub Release with auto-generated notes.
  • docker.yml — builds a multi-arch image and pushes to ghcr.io/almaju/coulisse tagged latest, 0.2, and 0.2.0.

Hotfixes

For patch releases on the latest minor, branch from the previous tag, fix forward, then tag v0.2.1 from that branch. The same workflow handles it.

Roadmap

What's in Coulisse today, and what's coming.

Working today

  • Multi-agent routing via the model field.

  • Agents as tools — expose one agent to another under subagents: with a purpose: description. Nested invocations are bounded by a depth cap.

  • Skills — reusable instruction bundles (Claude Code / Codex style). Drop a SKILL.md folder under ./skills; agents opt in by name and get one progressive-disclosure tool per skill, plus a skill_file reader for bundled resources.

  • Per-user conversation history with isolation.

  • Long-term memory with semantic recall — persistent via SQLite and backed by a real embedder (OpenAI or Voyage AI; hash fallback for offline dev).

  • Long-term user state — opt-in user_state: true enables a background extractor that pulls durable facts from each exchange and deduplicates them before storing. Embedder and extraction model are auto-derived from your configured providers.

  • Multi-backend support (Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq).

  • OpenAI-compatible HTTP API (/v1/chat/completions, /v1/models).

  • Studio UI at /admin/ — browse conversations, memories, and judge scores; edit agents, judges, experiments, and smoke tests live; watch the real-time task board at /admin/live.

  • LLM-as-judge evaluation — background scoring of agent replies against YAML-defined rubrics, with per-judge sampling and per-user persistence.

  • Experiments (A/B testing) — wrap multiple agents under one addressable name and route traffic between them with sticky-by-user defaults. Three strategies: split (weighted random), shadow (primary serves the user, others run in the background and are scored), and bandit (epsilon-greedy on a single judge criterion).

  • Streaming responses over SSE (stream: true, with stream_options.include_usage).

  • MCP tool integration over stdio and HTTP, with per-agent filtering.

  • Per-user OAuth 2.0 for MCP servers (token vault, connect-link flow, per-user session pool).

  • Per-user token rate limiting (hour / day / month).

  • Triggers — start agents on a schedule (cron), via HTTP POST (webhook), or on server boot (boot).

  • Async task queue — dispatch_task enqueues background work; tasks_status inspects the queue from chat; /admin/live shows it in real time.

  • Sidecars — long-lived helper processes (bridges, exporters) spawned and supervised by Coulisse.

  • Config variables (vars:) — named string snippets shared across agent preambles.

  • JSON Schema generation (coulisse schema) for IDE autocompletion and live validation.

  • YAML-driven config with startup validation.

  • Docker image with a volume-mounted SQLite store.

  • Credential-bound identity — auth.proxy.identity: from_credential derives the per-user identity from the authenticated principal (Basic username or OIDC sub) instead of trusting the request body, and rejects a mismatched safety_identifier. Makes adversarial multi-tenant serving safe; mutually exclusive with default_user_id. See User identification.

Planned

Durable rate-limit state

Current rate-limit counters live in memory — they reset on restart and don't span multiple instances. A durable, shared backend is planned so quotas survive reboots and horizontal scaling.

Vector index for large memory stores

Recall currently does a linear cosine scan over all memories for the user. Fine at hundreds-to-low-thousands of memories per user, but a vector index will be needed if per-user memory counts grow into the tens of thousands.

Per-agent memory overrides

Today the memory: block is global. A future revision will allow per-agent scoping (different embedders or budgets per agent) for cases where one agent handles long-form research and another handles short user chat.


This list reflects what's on deck at the time of writing — check the repository for the current state.