Coulisse

One YAML file. An OpenAI-compatible server with memory, tools, and multi-backend routing.

Coulisse is a single Rust binary that reads a coulisse.yaml file and spins up an OpenAI-compatible HTTP server. You point your existing tools, SDKs, and projects at it like any other OpenAI endpoint — and everything configurable lives in that one YAML file.

Why Coulisse?

Every multi-agent project ends up re-implementing the same plumbing:

  • Per-user conversation memory
  • Routing between model providers
  • Rate limits and retries
  • Tool integration
  • Multiple agents with different system prompts

Coulisse collapses this plumbing into one configurable server. You describe the setup in YAML and pilot the whole thing from there, instead of writing glue code for each prototype.

How it works

┌──────────────────┐        ┌──────────────────┐        ┌──────────────────┐
│  Your SDK / app  │───────▶│     Coulisse     │───────▶│   Anthropic      │
│  (OpenAI client) │        │                  │        │   OpenAI         │
└──────────────────┘        │   coulisse.yaml  │        │   Gemini …       │
                            │                  │        └──────────────────┘
                            │   + memory       │
                            │   + MCP tools    │        ┌──────────────────┐
                            │   + per-user     │───────▶│   MCP servers    │
                            └──────────────────┘        └──────────────────┘
  1. Your application talks to Coulisse using any OpenAI-compatible SDK.
  2. Coulisse picks the agent you asked for (by model name), assembles the user's memory, and calls the right backend.
  3. The response flows back — and the exchange is saved to that user's memory for next time.

What's in the box

FeatureStatus
Multi-agent routing✅ Working
Per-user memory✅ Persistent (SQLite) with semantic recall
Real embedders✅ OpenAI + Voyage (hash fallback for offline dev)
Auto-extraction✅ Optional — pulls durable facts from each exchange
MCP tool integration✅ Working (stdio + HTTP)
Multi-backend support✅ Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq
OpenAI-compatible API/v1/chat/completions, /v1/models
Streaming responses✅ Server-Sent Events
Rate limiting✅ Per-user token quotas (hour / day / month, in-memory)
Studio UI✅ Read-only at /admin/
Workflow orchestration⏳ Planned
Durable rate-limit state⏳ Planned

Continue to Installation to get started.

Stability

Coulisse is pre-1.0. It follows Semantic Versioning, but during the 0.x phase, minor version bumps (0.1 → 0.2) may include breaking changes to the YAML schema, HTTP surface, or CLI. Patch bumps (0.1.0 → 0.1.1) will not. See the Releasing chapter and CHANGELOG.md for the version history.

Installation

Coulisse is a single Rust binary. Install it from a prebuilt release or build from source.

Requirements

  • A valid API key for at least one supported provider

Install from a release

The latest GitHub Release ships installers for macOS (x86 + ARM), Linux GNU (x86 + ARM), and Windows MSVC.

macOS / Linux:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/Almaju/coulisse/releases/latest/download/coulisse-installer.sh | sh

Windows (PowerShell):

powershell -ExecutionPolicy Bypass -c "irm https://github.com/Almaju/coulisse/releases/latest/download/coulisse-installer.ps1 | iex"

The installer drops the coulisse binary on your PATH.

Build from source

Requires Rust (edition 2024) — install from rustup.rs.

git clone https://github.com/Almaju/coulisse.git
cd coulisse
cargo build --release

The binary lands at target/release/coulisse. Drop it on your PATH (or alias it) so the rest of this guide can call it as coulisse.

Initialize a config

coulisse init

This writes a minimal coulisse.yaml in the current directory: one OpenAI agent, sqlite memory, the offline hash embedder. Run coulisse init --from-example instead for the full annotated tour covering every section.

Edit the file to set your provider API key.

Start the server

coulisse start

start runs the server detached: it returns immediately and the process keeps running in the background. Stop it later with coulisse stop.

To run attached (logs streaming to your terminal), use coulisse start --foreground — or just coulisse with no subcommand. Either form binds port 8421.

You should see a startup banner like:

  coulisse 0.1.0

  Proxy   →  http://localhost:8421/v1
  Admin   →  http://localhost:8421/admin

  Memory     sqlite at ./.coulisse/memory.db; embedder=hash (dims=256, OFFLINE — no semantic understanding)
  Auth       proxy: open · admin: open

  Agents (1)
    assistant  openai / gpt-4o-mini

The exact lines depend on your config — what matters is that memory, auth, and every configured agent are each acknowledged on startup.

Next: write your first config, or read the CLI reference for every subcommand.

Your first config

A minimal coulisse.yaml has two things: a provider (where to send model calls) and an agent (how to call it).

providers:
  anthropic:
    api_key: sk-ant-your-key-here

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You are a helpful assistant.

Save this as coulisse.yaml in your working directory, then run coulisse.

What each piece does

providers

A map of provider kind → credentials. The key must be one of the supported kinds (see Providers). You only need to list the providers you actually use.

agents

A list of agents. Each agent is a named recipe:

  • name — the identifier. Clients ask for the agent by this name via the model field in their request.
  • provider — which configured provider to route to.
  • model — the upstream model identifier to call (e.g. claude-sonnet-4-5-20250929, gpt-4o).
  • preamble — optional system prompt prepended to every conversation.

You can define as many agents as you want — see Multi-agent routing for what that unlocks.

Adding more

Want a code reviewer, a pirate, and a tool-using agent? Just add more entries:

providers:
  anthropic:
    api_key: sk-ant-...
  openai:
    api_key: sk-...

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You are a helpful assistant.

  - name: code-reviewer
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: |
      You are a thorough code reviewer. Focus on correctness,
      clarity, and security.

  - name: gpt-assistant
    provider: openai
    model: gpt-4o
    preamble: You are a helpful assistant.

Restart the server — all three agents are now selectable by model name.

Next: make a request.

Making a request

Coulisse exposes an OpenAI-compatible API, so any OpenAI SDK works. Point the client at http://localhost:8421/v1 and set the model field to an agent name from your config.

curl

curl http://localhost:8421/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "assistant",
    "safety_identifier": "user-123",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8421/v1",
    api_key="not-needed",  # Coulisse doesn't check this
)

response = client.chat.completions.create(
    model="assistant",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"safety_identifier": "user-123"},
)

print(response.choices[0].message.content)

TypeScript / JavaScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8421/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "assistant",
  messages: [{ role: "user", content: "Hello!" }],
  // @ts-expect-error — extra field passed through
  safety_identifier: "user-123",
});

console.log(response.choices[0].message.content);

The safety_identifier field

Coulisse identifies users through the safety_identifier field (or the deprecated user field, which works too). The identifier is what keeps each user's conversation history isolated.

You can turn this off — see User identification — but by default every request needs one.

Listing available agents

curl http://localhost:8421/v1/models

Returns every agent you've defined, in OpenAI's model-list format.


That's the whole loop. Next, dig into how to configure providers.

Providers

Providers are where your model calls actually go. Configure each provider once with its credentials; reference it by name from any number of agents.

Supported providers

KindConfig key
Anthropicanthropic
Coherecohere
Deepseekdeepseek
Geminigemini
Groqgroq
OpenAIopenai

Shape

providers:
  anthropic:
    api_key: sk-ant-...
  openai:
    api_key: sk-...
  gemini:
    api_key: ...

Each provider takes a single field: api_key. You only need to list the providers you plan to use — unused ones can be omitted entirely.

Validation

When Coulisse loads your config, it checks that every agent's provider field matches a key under providers. Misspell a provider and startup fails with a clear error:

agent 'assistant' references provider 'antropic' which is not configured

Switching providers

Because providers are referenced by name, switching an agent from one backend to another is a one-line change:

agents:
  - name: assistant
    provider: anthropic            # ← change this …
    model: claude-sonnet-4-5-20250929   # ← … and this
    preamble: You are helpful.

No client code changes, no redeployment of downstream apps. See Multi-backend support for more on mixing providers.

Agents

Agents are the named personas clients can talk to. Each agent pins down:

  • Which provider to call
  • Which upstream model to ask for
  • What system prompt to prepend
  • Which tools (if any) to expose

Shape

agents:
  - name: code-reviewer
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: |
      You are a thorough code reviewer. Focus on correctness,
      clarity, and security. Point out subtle bugs and suggest
      concrete improvements.
    mcp_tools:
      - server: hello
        only:
          - say_hello

Fields

name (required)

The agent identifier. Clients select this agent by passing name as the model field in their request. Names must be unique across the config.

provider (required)

Must match a key under the top-level providers map. Tells Coulisse which backend to route through.

model (required)

The upstream model identifier. This is provider-specific — e.g. claude-sonnet-4-5-20250929 for Anthropic, gpt-4o for OpenAI, gemini-2.0-flash for Gemini.

preamble (optional)

A system prompt prepended to every conversation this agent handles. Use it to define tone, expertise, constraints, output format — anything you'd normally put in a system message.

Defaults to empty. YAML block scalars (|) are handy for multi-line preambles.

judges (optional)

A list of judge names (from the top-level judges: block) that evaluate this agent's replies in the background. Empty or omitted = no evaluation. See LLM-as-judge evaluation for the full story.

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    judges: [quality, deep_audit]

mcp_tools (optional)

A list of MCP servers and tools this agent is allowed to use. See MCP tools for the full story.

mcp_tools:
  - server: hello           # all tools from "hello"
  - server: calculator      # all tools from "calculator"
    only:                   # …but only these specific ones
      - add
      - multiply

subagents (optional)

A list of other agent names exposed to this agent as callable tools. When the agent's model decides to invoke one, Coulisse starts a fresh conversation against that agent and returns its final message as the tool result.

subagents: [onboarder, resume_critic]

Each name must refer to another entry under agents. Self-reference and duplicates are rejected at startup. Nested invocations are capped at depth 4 to prevent runaway loops. See Multi-agent routing for the full walkthrough.

purpose (optional)

A short tool description shown to other agents when this one is listed under their subagents. Keep it concrete — it's how a calling agent's model decides when to invoke this specialist. Omit it for agents that are only used directly by clients (never as subagents); fall back is "Invoke the '<name>' subagent." but a hand-written purpose is what makes multi-agent orchestration reliable.

purpose: Critique and rewrite a resume for a target role.

Runtime overrides

Agents can also be created, edited, and disabled at runtime through the admin UI or HTTP without touching coulisse.yaml. These runtime entries live in the SQLite database alongside conversation memory and judge scores; the YAML file is never modified by the server.

The resolution rule is simple: when a name is requested, the database is checked first. If a row exists there, it wins. Otherwise the YAML entry (if any) is used. A row can also be a tombstone — a marker that disables a YAML-declared name without removing it from the file.

Each runtime row carries a label visible in the admin UI:

  • yaml — the agent comes from coulisse.yaml, no database row exists.
  • dynamic — created via the admin UI or HTTP; no YAML entry of this name.
  • override — both YAML and the database define this name; the database version is what runs.
  • tombstoned — a database row disables this name; the agent is hidden from clients even if YAML still declares it.

A "Reset to YAML" action on an override deletes the database row, letting the YAML version reassert. The same action on a tombstoned row re-enables the agent. Database edits never modify the YAML file: if you want a change to survive a database wipe, edit the YAML.

Several agents, one config

Define as many agents as you want. A common pattern is having variants of the same model with different preambles:

agents:
  - name: friendly
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You are warm and encouraging.

  - name: terse
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: Reply in one sentence. No preamble, no filler.

  - name: pirate
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: Respond exclusively as a pirate, arrr.

Clients switch between them by changing the model field — no server redeploy, no code change.

MCP tools

Coulisse can borrow tools from Model Context Protocol servers and hand them to your agents. Two transports are supported:

  • stdio — Coulisse spawns a local command and talks to it over stdin/stdout.
  • http — Coulisse connects to a running Streamable-HTTP MCP endpoint.

Declaring MCP servers

Add an mcp section with a named entry per server:

mcp:
  hello:
    transport: stdio
    command: uvx
    args:
      - --from
      - git+https://github.com/macsymwang/hello-mcp-server.git
      - hello-mcp-server

  calculator:
    transport: http
    url: http://localhost:8080

stdio fields

  • transport: stdio
  • command (required) — the executable to spawn (uvx, python, node, …)
  • args (optional) — arguments to pass
  • env (optional) — environment variables for the child process
mcp:
  my-tool:
    transport: stdio
    command: python
    args: [-m, my_mcp_server]
    env:
      DEBUG: "1"
      API_KEY: abc123

http fields

  • transport: http
  • url (required) — the endpoint URL

Granting tool access to agents

An agent only sees tools you explicitly give it. Reference the server name under mcp_tools:

agents:
  - name: helper
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    mcp_tools:
      - server: hello           # all tools from "hello"

Restrict to a subset with only:

    mcp_tools:
      - server: hello
        only:
          - say_hello           # only this tool, nothing else

Discovering tool names

On startup Coulisse connects to each MCP server and logs the tools it discovered. Tool names in your only list must match what the server advertises — check the startup output or the server's own docs.

How tool calls work

When a request arrives for an agent with tools:

  1. Coulisse collects the agent's allowed tools from the MCP servers.
  2. It forwards them to the model as tool definitions.
  3. If the model calls a tool, Coulisse dispatches to the MCP server and feeds the result back.
  4. This loops until the model produces a final answer (up to 8 turns).

Your client doesn't see any of this — the tool loop is invisible, and only the final assistant message is returned.

See MCP tool integration for a full walkthrough.

Memory

The memory: block in coulisse.yaml controls where data is stored, which embedder turns text into vectors, and whether auto-extraction runs after each turn. Every field has a sensible default — omit the block entirely and Coulisse falls back to an on-disk SQLite file and the offline hash embedder.

Shape

memory:
  backend:
    kind: sqlite                   # 'sqlite' (default) or 'in_memory'
    path: ./coulisse-memory.db     # sqlite only
  embedder:
    provider: openai               # 'openai', 'voyage', or 'hash'
    model: text-embedding-3-small  # required for openai/voyage
    # api_key: <override>          # optional — falls back to providers.openai.api_key
  extractor:                       # omit to disable auto-extraction
    provider: anthropic            # one of providers.* keys
    model: claude-haiku-4-5-20251001
    dedup_threshold: 0.9           # optional
    max_facts_per_turn: 5          # optional
  context_budget: 8000             # optional
  memory_budget_fraction: 0.1      # optional
  recall_k: 5                      # optional

memory.backend

FieldTypeRequiredNotes
kindenumyessqlite or in_memory.
pathstringnoFilesystem path for sqlite. Created if missing. Default ./coulisse-memory.db.

in_memory is a SQLite database that lives only for the process lifetime — use it for tests or throw-away demos. sqlite is the production default; for Docker, point path at a volume-mounted location (e.g. /var/lib/coulisse/memory.db).

memory.embedder

FieldTypeRequiredNotes
providerenumyesopenai, voyage, or hash.
modelstringdependsRequired for openai and voyage. Ignored for hash.
api_keystringnoFalls back to providers.<provider>.api_key when unset.
dimsintnoHash only. Default 32.

Supported models

  • openai: text-embedding-3-small (1536 dims, default), text-embedding-3-large (3072 dims), text-embedding-ada-002 (1536 dims).
  • voyage: voyage-3.5 (1024, default), voyage-3-large (1024), voyage-3.5-lite (1024), voyage-code-3 (1024), voyage-finance-2 (1024), voyage-law-2 (1024), voyage-code-2 (1536).

Unknown model names fail at startup with a clear error.

Which to pick

  • Using Anthropic for completions? Anthropic has no embedding API — use Voyage (their official recommendation).
  • Using OpenAI? Stay on OpenAI for consistency.
  • Offline / air-gapped? Use hash — it has no semantic understanding but is fast and deterministic.

memory.extractor

Omit this block to disable auto-extraction. When present:

FieldTypeRequiredNotes
providerstringyesMust match a key under top-level providers:.
modelstringyesUpstream model identifier. Prefer the cheapest usable model.
dedup_thresholdfloatnoCosine similarity above which an extracted fact is considered a duplicate. Default 0.9.
max_facts_per_turnintnoCap on facts written per exchange. Default 5.

The extractor runs as a background task after each successful completion — it never blocks the HTTP response. Failures are logged at warn and swallowed.

Budget knobs

FieldDefaultMeaning
context_budget8,000 tokensTotal window for messages + memories.
memory_budget_fraction0.1 (10%)Share of the budget reserved for recalled memories.
recall_k5Top-k memories fetched per request.

Startup log line

On boot, Coulisse prints the memory config it resolved:

  memory: sqlite at ./coulisse-memory.db; embedder=openai / text-embedding-3-small
  extractor: anthropic / claude-haiku-4-5-20251001 (dedup_threshold=0.9, max_facts_per_turn=5)

Or when the extractor is off:

  extractor: disabled (memory only grows via explicit API calls)

Example configs

OpenAI end-to-end

providers:
  openai:
    api_key: sk-...

memory:
  embedder:
    provider: openai
    model: text-embedding-3-small
  extractor:
    provider: openai
    model: gpt-4o-mini

Anthropic completions + Voyage embeddings

providers:
  anthropic:
    api_key: sk-ant-...

memory:
  embedder:
    provider: voyage
    model: voyage-3.5
    api_key: pa-...          # Voyage is not under providers: so set the key here
  extractor:
    provider: anthropic
    model: claude-haiku-4-5-20251001

Offline dev — no external calls

memory:
  backend:
    kind: in_memory          # ephemeral; evaporates on restart
  embedder:
    provider: hash
  # no extractor, no embeddings API calls, no persistence

Telemetry

The telemetry: block controls observability — what Coulisse logs to stderr, what it persists to SQLite for the studio UI, and whether it ships traces to your own OpenTelemetry backend.

Every field has a sensible default. Omit the block and you get stderr logs at info plus the studio's per-turn event tree, with no external traces.

Shape

telemetry:
  fmt:
    enabled: true        # stderr logs; default on
  sqlite:
    enabled: true        # mirrors spans into the studio's tables; default on
  otlp:                  # absent = disabled (default)
    endpoint: "http://localhost:4317"
    protocol: grpc       # or http_binary
    service_name: coulisse
    headers:
      authorization: "Bearer ${OTEL_API_KEY}"

All three layers compose. Turn sqlite off if you don't need the studio. Add otlp to ship the same traces to Grafana, SigNoz, Jaeger, Honeycomb, or any OTLP-compatible backend.

telemetry.fmt

FieldTypeRequiredNotes
enabledboolnoDefault true.

Writes structured logs to stderr. The level is controlled by the RUST_LOG environment variable; without it, the default is info,sqlx=warn (info from Coulisse, warnings only from the SQL driver). To see internal SQL traffic, run with RUST_LOG=debug. To silence everything, set RUST_LOG=error.

telemetry.sqlite

FieldTypeRequiredNotes
enabledboolnoDefault true.

Mirrors turn and tool_call tracing spans into the events and tool_calls tables that the studio UI reads. Without this layer, the studio loses its per-turn event tree and tool-call panel.

The schema is part of the same SQLite file the rest of Coulisse persists into (controlled by memory.backend.path).

telemetry.otlp

Absent (the default) means Coulisse does not export traces externally. To plug Coulisse into your own observability stack, set the block:

FieldTypeRequiredNotes
endpointstringyesCollector URL.
protocolenumnogrpc (default) or http_binary.
service_namestringnoOpenTelemetry resource attribute service.name. Default coulisse.
headersmapnoStatic HTTP/gRPC headers attached to every export.

Endpoint defaults

  • gRPC (the default): port 4317, e.g. http://localhost:4317.
  • HTTP/protobuf: port 4318, e.g. http://localhost:4318/v1/traces.

The collector you point at decides the rest — Coulisse ships traces with service.name = coulisse and span names turn, tool_call, and llm_call. Span fields carry user_id, turn_id, agent, tool_name, kind, and the rest documented in the features chapter.

Headers

Useful for managed backends:

telemetry:
  otlp:
    endpoint: "https://ingest.us.signoz.cloud:443"
    protocol: grpc
    headers:
      "signoz-access-token": "${SIGNOZ_TOKEN}"

YAML doesn't expand ${...} itself; substitute at deploy time (helm, envsubst, sops, etc.).

How the layers compose

The cli installs a single tracing_subscriber registry with the layers your config asked for, in order:

  1. RUST_LOG env filter
  2. fmt → stderr (when fmt.enabled)
  3. sqliteevents + tool_calls tables (when sqlite.enabled)
  4. otlp → external collector (when otlp is set)

Every span emitted by the running server fans out to all enabled layers. There is no priority or fallback — the SQLite layer keeps full payloads (full prompts, args, results), the OTLP layer ships the same span attributes to your collector. If your backend chokes on multi-megabyte attributes, drop those fields in your collector pipeline rather than at the source.

User identification

Coulisse keeps separate memory per user. To do that, it needs to know who is making each request.

How users are identified

Requests identify the user via one of these fields, in order:

  1. safety_identifier (preferred — matches OpenAI's recent schema)
  2. user (deprecated, but still accepted)
{
  "model": "assistant",
  "safety_identifier": "alice@example.com",
  "messages": [...]
}

The identifier can be anything — an email, an internal user ID, a UUID, an opaque token. Coulisse derives a stable internal UUID from it:

  • If you pass a valid UUID, that's what's used.
  • Otherwise, a deterministic v5 UUID is derived from the string, so the same identifier always maps to the same user.

Requiring identification

By default, Coulisse requires every request to carry an identifier. Unidentified requests are rejected with an error. This is the safe default: memory only works if you know who you're talking to.

default_user_id: a single-user fallback

For local development or single-user deployments, you can declare a default_user_id in coulisse.yaml. When a request arrives without safety_identifier or user, Coulisse acts as if that default had been passed.

default_user_id: main        # everyone's anonymous requests bucket here

providers:
  anthropic:
    api_key: sk-ant-...

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929

With a default_user_id set:

  • Requests that omit both safety_identifier and user fall back to the default. They get memory like any other user — just scoped to that shared bucket.
  • Requests that do include an identifier still get their own scope.
  • All anonymous requests share one memory bucket and one rate-limit counter, because they all map to the same id.

When to set it

Good reasons:

  • Local / single-user setups where you don't want to bother sending an identifier.
  • Small deployments behind an auth layer that handles identity upstream but doesn't want to plumb it through.

Don't set default_user_id in multi-tenant deployments — every user would share one bucket, which defeats isolation. Leave it unset so missing identifiers are rejected.

Studio UI

Coulisse ships a studio UI for browsing the conversations and memories the server has seen, and for editing the live YAML config. It's served by the same binary, under /admin/.

Point a browser at http://localhost:8421/admin/ while the server is running.

What you can do

  • List every user the server has seen, most recent activity first, with message and memory counts.
  • Open a user to see their full conversation (user, assistant, and system messages) with per-message token counts and relative timestamps.
  • See every tool invocation that happened during each assistant turn — rendered inline in the conversation as a collapsed block above the assistant bubble. Expand to see the args, the result (or error body), and a badge marking MCP vs subagent calls. This is the debug view for figuring out what the agent tried and what came back.
  • Open the per-turn Telemetry block under any assistant message to see the full causal tree that produced it: every tool call (MCP or subagent) at every depth, with args, result, error, and duration. Unlike the inline top-level tool calls, the telemetry tree also surfaces tool calls made inside subagents — so when a subagent's MCP call fails, the real error is right there instead of being paraphrased into the assistant's text.
  • See the long-term memories recalled for that user, tagged as fact or preference.
  • See the LLM-as-judge scores for that user, including mean score per (judge, criterion) and the most recent individual scores with reasoning.
  • Browse configured experiments at /admin/experiments — strategy, sticky-by-user flag, per-variant weights, and bandit-strategy mean scores live-loaded from judges.
  • Run smoke tests at /admin/smoke — a synthetic-user persona drives a real conversation against any agent or experiment, scores fan out through the same judge pipeline, and the run viewer shows the full transcript with persona/assistant turns side by side. Useful for iterating on agent prompts without writing test scaffolding.
  • Edit, add, or disable agents, judges, experiments, and smoke tests at /admin/agents, /admin/judges, /admin/experiments, and /admin/smoke. Each form is a YAML textarea over the same config shape used in coulisse.yaml. Edits and creations write to the database, never to coulisse.yaml; runtime resolution checks the database first, then falls back to YAML. List views label each row as yaml, dynamic (database-only), override (database shadows YAML), or tombstoned (disabled). Override rows expose a "Reset to YAML" action that drops the database row so the YAML version reasserts. See Agents → Runtime overrides for the full semantics — judges, experiments, and smoke tests follow the same model.

Editing config: admin UI = API

Every admin route is content-negotiated. The same URL serves an HTML page in a browser, an HTML fragment to htmx, and JSON to a script — whichever the client's Accept/HX-Request headers ask for. The UI is a thin representation of the API; nothing the UI can do is unavailable to a curl call.

# List agents as JSON (effective merged view: database overrides + YAML)
curl -H 'Accept: application/json' http://localhost:8421/admin/agents

# Update an agent (writes to the database, not to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/agents/bob \
     -H 'Content-Type: application/yaml' \
     --data-binary $'name: bob\nprovider: openai\nmodel: gpt-4o\n'

# Reset an override or tombstone — drops the database row, YAML reasserts
curl -X POST http://localhost:8421/admin/agents/bob/reset

# Replace the whole config file in one shot (this writes to coulisse.yaml)
curl -X PUT http://localhost:8421/admin/config \
     -H 'Content-Type: application/yaml' \
     --data-binary @coulisse.yaml

Agent writes through /admin/agents go to the database, never to coulisse.yaml. Other sections (/admin/config, providers, judges, experiments, smoke tests, etc.) still write to YAML. The two write paths are independent: editing an agent in the database has no effect on the file you committed to git.

File watcher: hand-edits hot-reload

Coulisse watches coulisse.yaml while it runs. Edit it in your editor, save, and the live config updates without a restart. The validator runs before any reload — a broken edit is logged and the previous in-memory config keeps serving traffic until you fix the file.

What hot-reloads today: the agents list (runtime + admin display), the judges and experiments lists (admin display only — the routing tables that consume them are still rebuilt on restart). What still requires restart: providers, MCP servers, memory backend, telemetry pipeline, auth.

YAML formatting

Admin saves go through serde_yaml round-trip serialization, so comments, blank lines, and key ordering are not preserved. If you want commented config, hand-edit the file — the watcher picks the change up the same way an admin save would. Comment-preserving writes are tracked as a follow-up.

Authentication

The admin surface is gated by the auth.admin scope in coulisse.yaml. Two mutually exclusive modes: HTTP Basic auth (good for local dev) or OIDC single sign-on (appropriate for shared deployments). Exactly one belongs under auth.admin.

The /v1/chat/completions and /v1/models endpoints use the separate auth.proxy scope — they are never gated by admin auth. SDK clients stay cookie-free even when the studio runs behind OIDC.

Basic auth

auth:
  admin:
    basic:
      password: choose-something-strong
      username: admin   # optional, defaults to "admin"

Every /admin/* request must carry Authorization: Basic <base64(user:pass)>. Browsers prompt via the native login dialog and cache credentials per origin.

OIDC (single sign-on)

Works with any OIDC-compliant IdP: Authentik, Keycloak, Auth0, Google, Microsoft, Okta.

auth:
  admin:
    oidc:
      issuer_url:    https://authentik.example.com/application/o/coulisse/
      client_id:     coulisse-admin
      client_secret: <confidential-client-secret>   # omit for public PKCE clients
      redirect_url:  http://localhost:8421/admin/
      scopes:        [email, profile]               # optional; openid is always added

On first request, the user is redirected to the IdP to log in; afterwards an encrypted session cookie keeps them authenticated on /admin/* until it expires (8 hours of inactivity).

Access control (who may log in) is delegated to the IdP. Coulisse treats "successfully authenticated by your IdP" as "authorized admin" — configure the allow-list in the IdP's application policy, not here.

Authentik setup: create a new OAuth2/OpenID Provider and Application, set the redirect URI to the redirect_url above (Authentik allows every subpath of it by default), and point Coulisse at the issuer URL of the provider. Add the application to the groups that should have access via Authentik bindings.

Sessions are in-memory: they evaporate on restart — users re-authenticate silently if their IdP session is still valid, otherwise they see the login page again.

Leaving it open

Omit the auth.admin block to leave the admin surface unauthenticated. That's fine on a loopback-only dev box, but never expose an unauthenticated admin surface to the network. If you'd rather terminate auth at your infrastructure layer, put Coulisse behind a reverse proxy (oauth2-proxy, Cloudflare Access, Caddy's forward_auth), a VPN, or an SSH tunnel.

How it's built

The studio is composed in the cli binary. Each feature crate (memory, telemetry, judges, experiments) owns its own admin module — its routes, its askama templates, and its view models. Cli wires them together: a single base.html shell, the auth wrapping, and a tower middleware that wraps non-htmx responses in the layout so bookmarked deep URLs render with full navigation.

Cross-feature views (e.g. tool-call panels inside a conversation page) are filled in via htmx fragments — the conversation page, owned by memory, embeds hx-get requests against telemetry and judges. No feature crate depends on another for its admin surface; the browser orchestrates the composition. Tailwind (loaded via CDN) provides styling. Everything ships in the single Coulisse binary; there is no separate frontend build step.

Multi-agent routing

Coulisse lets you define multiple agents and route between them with nothing more than the model field of a request. No extra endpoints, no custom headers, no proxy tricks.

Why it matters

Most apps end up needing more than one model configuration:

  • A fast, cheap agent for classification and quick replies.
  • A heavier agent for hard reasoning.
  • A specialized agent (code reviewer, translator, summarizer) with a tuned preamble.
  • A tool-using agent that can reach into an MCP server.

Without something like Coulisse, that means either multiple deployments or a growing pile of if (mode === ...) switches inside your app.

The pattern

Declare each variant as a separate agent:

agents:
  - name: triage
    provider: anthropic
    model: claude-haiku-4-5-20251001
    preamble: Classify the user's intent. Reply with a single word.

  - name: reasoner
    provider: anthropic
    model: claude-opus-4-7
    preamble: You are a careful reasoner. Think step by step.

  - name: translator
    provider: openai
    model: gpt-4o
    preamble: Translate the user's message into French.

Your application picks which agent to call by setting the model field:

fast  = client.chat.completions.create(model="triage", ...)
smart = client.chat.completions.create(model="reasoner", ...)
fr    = client.chat.completions.create(model="translator", ...)

What each agent brings to the request

When a request arrives, Coulisse:

  1. Looks up the named agent.
  2. Prepends the agent's preamble as a system message.
  3. Resolves the agent's allowed MCP tools (if any).
  4. Forwards the call to the agent's configured provider and model.
  5. Records the exchange in the caller's per-user memory.

Changing agents is free — you don't need to redeploy anything on the client side.

Discovering agents at runtime

GET /v1/models returns every agent in the config in OpenAI's standard model-list format. Useful for UIs that want to populate a model picker from the server:

curl http://localhost:8421/v1/models

Subagents: agents as tools

Routing by model lets the client pick an agent per request. Sometimes you want one agent to call another from within a turn, so the conversation stays with the top-level agent while specialists handle focused sub-tasks. Coulisse exposes this via the subagents field.

agents:
  - name: onboarder
    provider: anthropic
    model: claude-haiku-4-5-20251001
    purpose: Collect the user's profile — first name, last name, phone, goals.
    preamble: |
      Ask the user for any missing profile field. Keep questions short.

  - name: resume_critic
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    purpose: Critique and rewrite a resume for a target role.
    preamble: |
      Given a resume and a target role, return a revised resume and
      a bullet list of the biggest gaps to address.

  - name: career_coach
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    subagents: [onboarder, resume_critic]
    preamble: |
      Guide the user. Delegate to `onboarder` if the profile is
      incomplete, and `resume_critic` when they want resume work.

When career_coach runs, the onboarder and resume_critic agents appear in its tool list alongside any MCP tools. If the model calls onboarder, Coulisse starts a fresh conversation against that agent with just the message it was given — the onboarder sees its own preamble and its own MCP tools, nothing inherited from the parent. The onboarder's final assistant message is returned to the coach as the tool result.

The purpose field

purpose is the tool description shown to the calling agent. It's how the coach's LLM decides whether this subagent is the right choice for the current turn. Keep it short and concrete — "Critique and rewrite a resume for a target role" is good; "Helpful assistant" is useless.

If purpose is absent, Coulisse falls back to "Invoke the '<name>' subagent." — functional, but a clear purpose is what makes orchestration reliable.

Bounded recursion

Calling a subagent is itself a tool call — the subagent can have its own subagents, which can have their own, and so on. To prevent a pathological A → B → A → … loop from burning tokens, Coulisse caps nested invocations at depth 4. Going over returns a clear error that the parent agent sees and can react to.

Fresh context

Every subagent invocation starts with a new conversation. The subagent does not see the parent's message history, the user's original request, or any other sibling subagent's output. It gets only the message the parent passed when calling it, plus its own preamble.

This isolation is deliberate. It keeps subagents focused, prevents context bloat, and makes each subagent's behavior reproducible in isolation. If you want data to flow between agents, store it in an MCP server and have both agents read it — Coulisse owns no cross-agent state.

Why subagents and MCPs live side by side

mcp_tools and subagents both appear in an agent's tool list, but they model different things:

  • An MCP tool is a stateless function call against an external server: fixed schema, data in and data out.
  • A subagent is another LLM conversation that happens to be kicked off by a tool call. It has its own preamble, its own tool loop, and can itself delegate further.

Reach for mcp_tools when the work is a concrete operation (save a record, search a database, send an email). Reach for subagents when the work needs its own LLM reasoning under a different preamble.

Per-user memory

Every request that carries a user identifier gets an isolated, persistent memory scope. Coulisse tracks two kinds of memory:

  • Conversation history — the running transcript of messages the user has exchanged.
  • Long-term memories — durable facts and preferences, embedded for semantic recall.

You don't need to manage this — it happens automatically on every request. When auto-extraction is on, Coulisse also decides what is worth remembering.

What happens on each request

  1. Coulisse identifies the user from safety_identifier (or user).
  2. It pulls the user's recent messages, fitting as many as possible into the context budget.
  3. It runs a semantic recall against the user's long-term memories, picking the top matches.
  4. It builds the final prompt: agent preamble → recalled memories → recent history → new message.
  5. The model's reply is sent back and saved to the user's transcript.
  6. If an extractor is configured, a background task asks a cheap model "any durable facts to remember from this exchange?" and stores novel ones.

Step 6 does not block the HTTP response — the user gets their answer first; memory grows in the background.

Isolation guarantees

User isolation is enforced by the API: Store::for_user(id) returns a handle scoped to a single user, and every SQL query bound through it filters on that user id. There is no code path that mixes data across users.

The context budget

KnobDefaultMeaning
context_budget8,000 tokensTotal window size for messages + memories.
memory_budget_fraction0.1 (10%)Share of the budget reserved for recalled long-term memories.
recall_k5How many long-term memories to recall per request.

The remaining 90% goes to recent message history, newest first. If the history doesn't fit, older messages are dropped.

Embedders

Long-term memories are embedded as vectors. On each request, Coulisse embeds the incoming message and retrieves the top-k most similar memories by cosine similarity. That's how context from a conversation two weeks ago can surface when it becomes relevant again.

ProviderSupported modelsNotes
openaitext-embedding-3-small, text-embedding-3-large, text-embedding-ada-002Default pairing for OpenAI-first setups.
voyagevoyage-3.5, voyage-3-large, voyage-3.5-lite, voyage-code-3, voyage-finance-2, voyage-law-2, voyage-code-2Anthropic officially recommends Voyage for embeddings.
hashn/aDeterministic bag-of-words, offline only. No semantic understanding — use only for tests and air-gapped development.

Startup logs the chosen embedder. For hash the log line carries an explicit "OFFLINE — no semantic understanding" tag so nobody deploys it by accident.

Auto-extraction ("remember what matters")

When you set memory.extractor in YAML, every completed exchange fires a background task that:

  1. Sends the last user-turn + assistant-turn to a cheap model with a focused prompt: "list any durable facts or preferences about the user; return [] if nothing worth keeping."
  2. Parses the JSON response.
  3. For each extracted fact, calls remember_if_novel — which embeds the fact and skips it if cosine similarity against an existing memory exceeds dedup_threshold (default 0.9).

Failures (bad JSON, timeout, provider error) are logged at warn and swallowed — the user already got their response. Extraction is best-effort.

To disable, omit the memory.extractor block entirely. Memories will still be recalled and can be populated through other code paths, but nothing writes to them automatically.

What gets stored where

DataScopeLifetime
Conversation messagesPer userSQLite (messages table)
Long-term memories + vectorsPer userSQLite (memories table, BLOB embeddings)
Tool invocationsPer userSQLite (tool_calls table, linked to messages.id)
Judge scoresPer userSQLite (scores table, linked to messages.id)
User identifier → internal IDSharedDerived deterministically — no storage needed

Each memory row carries the id of the embedder that produced it. If you swap the embedder, old vectors become ineligible for recall (they'd be scored in the wrong space). They stay in the database but are silently ignored until you re-embed them.

Storage location

Defaults to ./coulisse-memory.db. Override with:

memory:
  backend:
    kind: sqlite
    path: /var/lib/coulisse/memory.db

For tests or one-shot demos, use kind: in_memory — everything evaporates on shutdown.

Docker

The bundled Dockerfile declares a VOLUME /var/lib/coulisse so data survives container restarts. Mount a named volume or a host directory there:

docker run \
  -v coulisse-data:/var/lib/coulisse \
  -v $(pwd)/coulisse.yaml:/etc/coulisse/coulisse.yaml:ro \
  -p 8421:8421 \
  coulisse

The container runs as a non-root coulisse user and expects the database path inside the volume, e.g. /var/lib/coulisse/memory.db.

See memory configuration for the full YAML schema.

MCP tool integration

Coulisse is a client for Model Context Protocol servers. Any MCP-compliant tool — a calculator, a filesystem browser, a REST API wrapper, your in-house data fetcher — becomes usable by any agent with a one-line config change.

End-to-end example

Imagine a small MCP server that exposes a say_hello tool. Register it and hand it to an agent:

providers:
  anthropic:
    api_key: sk-ant-...

mcp:
  hello:
    transport: stdio
    command: uvx
    args:
      - --from
      - git+https://github.com/macsymwang/hello-mcp-server.git
      - hello-mcp-server

agents:
  - name: greeter
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You greet people warmly.
    mcp_tools:
      - server: hello

Start the server. On boot you'll see Coulisse discover the server's tools and note them in the log.

Now the greeter agent can call say_hello whenever the model decides it's useful. Your client makes a normal chat completion request:

curl http://localhost:8421/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "greeter",
    "safety_identifier": "user-1",
    "messages": [
      {"role": "user", "content": "Please greet Alice."}
    ]
  }'

The model may call the tool one or more times; Coulisse runs the tool loop internally and returns only the final assistant message.

Under the hood, every invocation — tool name, arguments, result (or error) — is recorded against the assistant message that produced it, so you can replay the turn in the studio UI and see which tools fired and what came back. This is tool-call capture for debugging, not an extension of the OpenAI surface: the wire response your SDK receives is unchanged.

Transports

  • stdio — good for local MCP servers you spawn yourself (Python scripts, Node programs, CLI tools). Coulisse manages the child process.
  • http — good for long-running MCP services, especially ones shared across multiple Coulisse instances.

Both are configured the same way conceptually; see MCP tools for fields.

Scoping tools per agent

Different agents can see different subsets of tools, even from the same server:

agents:
  - name: power-user
    mcp_tools:
      - server: filesystem      # every tool the filesystem server offers

  - name: read-only
    mcp_tools:
      - server: filesystem
        only:
          - read_file
          - list_files          # write / delete tools aren't exposed

This is Coulisse-side filtering — the model never sees the excluded tools, so it can't call them.

Tool loop limits

Coulisse caps a single request at 8 tool-call turns. If the model hasn't produced a final answer by then, the request ends. This keeps runaway loops from billing you forever.

Capture limitations

Tool-call capture only runs on the streaming path — every OpenAI SDK uses streaming for chat completions by default, so this covers normal usage. Non-streaming requests ("stream": false) still execute tools correctly; their invocations just aren't captured for the studio trail, because rig's non-streaming API doesn't expose intermediate events.

If a client disconnects mid-stream after a tool call has fired but before the result lands, the call is persisted with result: null so the studio UI still shows that the attempt happened.

Multi-backend support

Coulisse speaks to six providers out of the box:

  • Anthropic
  • OpenAI
  • Gemini
  • Cohere
  • Deepseek
  • Groq

You can mix them freely in a single config.

Why mix backends?

  • Cost tiering. Run quick tasks on a cheap model (Groq, Haiku, gpt-4o-mini), hard tasks on a flagship.
  • Capability routing. Some tasks benefit from a specific provider's strengths — long-context summarization on Gemini, coding on Sonnet, reasoning on Opus.
  • Redundancy. If one provider has an outage, flip a single provider field to route through another.
  • Evaluation. A/B the same preamble on two different models without changing any client code.

One config, many backends

providers:
  anthropic:
    api_key: sk-ant-...
  openai:
    api_key: sk-...
  gemini:
    api_key: ...
  groq:
    api_key: ...

agents:
  - name: quick
    provider: groq
    model: llama-3.3-70b-versatile
    preamble: Answer briefly.

  - name: smart
    provider: anthropic
    model: claude-opus-4-7
    preamble: Think carefully.

  - name: long-context
    provider: gemini
    model: gemini-2.0-flash
    preamble: You excel at synthesizing long documents.

Your client picks one by name — everything else stays the same.

The client side is unchanged

Because Coulisse exposes an OpenAI-compatible API no matter which provider is behind an agent, your client code never has to know. You don't install the Anthropic SDK, Gemini SDK, and OpenAI SDK side by side — you just use the OpenAI SDK and change the model field.

Streaming responses

Coulisse implements OpenAI's Server-Sent Events (SSE) format for chat completions. Set stream: true in the request and the server emits incremental chat.completion.chunk frames over the wire — drop-in compatible with the OpenAI Python and JavaScript SDKs and any client that already speaks the OpenAI streaming protocol.

Asking for a stream

Add stream: true to a normal /v1/chat/completions request:

{
  "model": "assistant",
  "safety_identifier": "user-123",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": true
}

The response is text/event-stream instead of JSON. Each frame is one chat.completion.chunk.

Wire format

The first frame announces the assistant role:

data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"role":"assistant"}}]}

Then one frame per text delta:

data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":" there"}}]}

A terminal frame sets finish_reason:

data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Including token usage

Set stream_options.include_usage: true to receive a usage field on the terminal chunk:

{
  "model": "assistant",
  "messages": [{"role": "user", "content": "Hi"}],
  "stream": true,
  "stream_options": {"include_usage": true}
}

The terminal frame then carries usage:

data: {"...":"...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"completion_tokens":3,"prompt_tokens":7,"total_tokens":10}}

When include_usage is missing or false, the field is omitted — matching OpenAI's contract.

Memory and rate limiting

Streaming responses use the same per-user memory bucket and rate-limit accounting as non-streaming requests:

  • The user's message and the assistant's reply are appended to memory after the stream ends.
  • Token usage is recorded against the rate-limit window when the stream ends.
  • If the client disconnects mid-stream, Coulisse persists the partial assistant reply (everything received before the disconnect). This matches what the user actually saw — the next turn won't claim the model said something the user never received.

Tool-using agents

Agents with MCP tools attached stream the same way. Tool-call internals run inside the rig multi-turn loop and are not surfaced to the client; you'll see a pause while a tool runs, then the model's text continues. The delta.content field is the only delta variant Coulisse currently emits.

Errors mid-stream

If the upstream provider fails after the stream has started, Coulisse emits one terminal frame containing an error field with the failure reason, then [DONE]. The HTTP status is already 200 by then — clients should check for the error field on the final chunk.

Rate limiting

Coulisse enforces per-user token limits across three rolling windows: hour, day, and month. Limits are set by the client, per request — not in the YAML — so callers can plug Coulisse into existing quota schemes without redeploying the server.

How it works

  1. Each request carries optional limit hints in its metadata field: tokens_per_hour, tokens_per_day, tokens_per_month.
  2. Before the model is called, Coulisse looks up the user's current usage in each window. If any counter is already at its cap, the request is rejected with 429 Too Many Requests.
  3. If the request passes, Coulisse runs it. On success, the total tokens consumed (request + response) are added to the user's counters.
  4. Counters reset on fixed boundaries: every hour, every 24 hours, every 30 days (aligned to UTC windows from the Unix epoch).

Sending limits

Put the caps in the metadata object. Values are strings (OpenAI's metadata contract), parsed as non-negative integers:

{
  "model": "assistant",
  "safety_identifier": "alice@example.com",
  "metadata": {
    "tokens_per_hour": "50000",
    "tokens_per_day": "500000",
    "tokens_per_month": "5000000"
  },
  "messages": [
    {"role": "user", "content": "Hi!"}
  ]
}

All three keys are independent and all are optional — send only the windows you care about. Omit the whole metadata object and the request is unlimited.

When a limit is hit

The server responds with:

  • Status: 429 Too Many Requests
  • Header: Retry-After: <seconds> — time until the offending window resets
  • Body:
{
  "error": {
    "type": "rate_limited",
    "message": "daily token limit exceeded: used 512000/500000, retry after 40213s"
  }
}

The message names which window tripped (hourly, daily, monthly), how many tokens were used, the cap, and the seconds to wait.

Invalid metadata

If a metadata value isn't a valid non-negative integer, the server returns 400 Bad Request:

{
  "error": {
    "type": "invalid_request",
    "message": "metadata key 'tokens_per_hour' must be a non-negative integer, got 'abc'"
  }
}

Scope and isolation

  • Per user. Each user (keyed by safety_identifier or the fallback user field) has isolated counters.
  • Anonymous requests can't be rate-limited. Coulisse needs an identifier. In setups with a default_user_id (see User identification), all anonymous requests share that user's counter.
  • Per process. Counters live in memory. If you run multiple Coulisse instances behind a load balancer, each has its own view — for shared quotas, limit upstream (in a gateway) instead.
  • Lost on restart. Counters are not persisted. This is deliberate for now; durable accounting is on the roadmap.

Why per-request limits instead of YAML?

Quotas usually live in your user/billing system, not your model-routing config. Putting limits in the request lets the caller decide — e.g. your app looks up the user's plan, fills in the numbers, and forwards the request. Coulisse just honors what you send.

Token cost tracking

Coulisse converts each chat completion's token usage into a USD cost using a vendored snapshot of LiteLLM's model pricing table. The cost lands in the per-turn llm_call event alongside the raw token counts, so the studio UI shows it next to every model call.

There's nothing to enable. As long as a turn produces token usage and the model is in the table, you'll see a $0.0042-style badge on the corresponding llm_call row in the per-turn event tree.

How it's computed

For each completion Coulisse looks up the configured (provider, model) pair in the vendored table and multiplies:

  • input_tokens × input_cost_per_token
  • output_tokens × output_cost_per_token
  • cache_creation_input_tokens × cache_creation_input_token_cost (Anthropic prompt-cache writes)
  • cached_input_tokens × cache_read_input_token_cost (Anthropic prompt-cache reads)

Missing fields in the upstream table are treated as zero — fine for providers like Groq that don't price cache tokens. Models that don't appear in the table at all yield a null cost: the request still succeeds, the llm_call event still records the token usage, and the studio simply omits the cost badge.

Refreshing the pricing table

The snapshot lives at crates/providers/data/model_prices.json and is checked into git. New models are added upstream regularly; refresh the snapshot with:

just refresh-prices

This downloads the latest version from LiteLLM's main branch and overwrites the local file. The diff lands in git like any other change so you can review what moved before committing.

There's no live fetching at runtime: cost lookup only ever reads from the vendored snapshot. That keeps the request path free of network dependencies and makes pricing updates an explicit, reviewable action.

What's not (yet) covered

  • EUR or other currencies. Cost is stored and displayed in USD only. If there's demand for a configurable display currency (telemetry.display_currency: { code: EUR, usd_rate: 0.92 }-style), it can be added without changing the on-disk format.
  • Cost-based rate limiting. Rate limits currently work on token counts. Cost is recorded but not yet enforced; a future usd_per_day: knob would consume the same data.
  • Per-tool / per-MCP cost. Tool calls have their own tool_call events but don't carry a cost themselves. Costs are charged to the parent llm_call event, which is the only place tokens are spent.
  • Custom or unlisted models. Self-hosted models or models that LiteLLM hasn't added yet won't have a price. There's no YAML override path today; if you need one, open an issue describing the use case.

Response language

Coulisse lets the caller pin the language the model replies in. Without it, the model infers language from the user's message — which can drift when the user switches languages mid-conversation or types short, ambiguous prompts. With it, every response comes back in the language you asked for.

Language is set per request, via the metadata object. The caller decides — Coulisse doesn't maintain a user-level language preference.

How to send it

Add a language key to metadata. The value is a BCP 47 tag (RFC 5646):

{
  "model": "assistant",
  "safety_identifier": "user-123",
  "metadata": {
    "language": "fr-FR"
  },
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}

Any valid BCP 47 tag works: en, fr, fr-FR, es-MX, zh-Hant, ja-JP. The tag is validated — malformed values come back as 400 Bad Request. Omit the key entirely to let the model pick.

How it reaches the model

Coulisse appends a short instruction to the system preamble before calling the provider — something like Always reply in French, even when the user writes in a different language. Do not include translations in any other language.. The instruction is phrased as a hard constraint so the model doesn't mirror the user's language or append a parenthetical translation. For tags in the built-in language-name table (common ISO 639-1 subtags: en, fr, es, de, it, pt, ja, zh, ko, ar, nl, pl, ru, sv, tr, hi), the instruction uses the English name. For anything else, the raw tag is passed through — frontier models understand BCP 47 directly, so cy (Welsh) works fine.

The instruction is added once per request, as the first system message. Your own system messages in the messages array still apply, and agent preambles from coulisse.yaml are preserved.

Real-world example: country code to language

A common pattern is to derive the language from the caller's locale on your side — phone country code, IP-based geolocation, browser Accept-Language, a user profile setting — and forward the resulting tag:

{
  "model": "assistant",
  "safety_identifier": "+33612345678",
  "metadata": {
    "language": "fr-FR"
  },
  "messages": [
    {"role": "user", "content": "What's the weather?"}
  ]
}

Coulisse doesn't do the mapping itself. It takes the tag you send and asks the model to respond in that language. That keeps the metadata format stable and the country-code-to-language table (which changes slowly but does change) out of server code.

Errors

A malformed tag returns 400 Bad Request:

{
  "error": {
    "type": "invalid_request",
    "message": "invalid `metadata.language`: invalid language tag: ..."
  }
}

Empty-string and whitespace-only values are rejected the same way.

LLM-as-judge evaluation

Coulisse can score every agent reply with a separate LLM — a judge — and persist the result so you can track quality over time. You describe what to evaluate in the YAML rubric; Coulisse handles scoring shape, format, sampling, and storage.

This is useful for watching agent drift, comparing model/preamble changes, and catching regressions without standing up a separate evaluation pipeline.

How it works

  1. A client sends a chat request. The agent replies as usual — the judge never blocks the response.
  2. After the reply is persisted, Coulisse runs each judge the agent opted in to, in a background task.
  3. Each judge samples according to its sampling_rate (skip entirely if the draw misses), then asks its backing model to score the assistant's reply against every rubric at once.
  4. The response is parsed into one score row per rubric — persisted under the same user id as the conversation.
  5. Failures (bad JSON, provider error, timeout) are logged at warn and swallowed — the user already got their answer.

Scores are stored in the same SQLite database as messages and memories, in a scores table keyed by message_id. Averages are computed at read time, not aggregated on write.

YAML

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: You are a helpful assistant.
    judges: [quality]              # opt in by name

  - name: translator
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: Translate into French.
    judges: [fluency]

judges:
  # Cheap, broad check — 100% of turns, small model.
  - name: quality
    provider: openai
    model: gpt-4o-mini
    sampling_rate: 1.0
    rubrics:
      accuracy:     Factual accuracy. Flag hallucinations.
      helpfulness:  Whether the assistant answered the user's question.
      tone:         Politeness and tone.

  # Targeted check for the translator — only 20% of turns.
  - name: fluency
    provider: openai
    model: gpt-4o-mini
    sampling_rate: 0.2
    rubrics:
      grammar:      Grammatical correctness of the French output.
      naturalness:  How native the phrasing sounds.

The wiring is visible from the agent: when you read an agent block you see which judges score it, rather than having to hunt through the judge list to figure out coverage.

Rubrics

A rubric is a map from criterion name to a short description of what to assess.

rubrics:
  accuracy:    Factual accuracy. Flag hallucinations.
  helpfulness: Whether the assistant answered the user's question.

Keep descriptions terse and assess-able. Don't write scale, format, or JSON instructions into them — Coulisse adds those internally. The description should tell the judge what matters, not how to answer.

Each criterion produces one Score row per scored turn, with its own numeric value and short reasoning. All criteria for one judge are evaluated in a single LLM call, so adding criteria to a judge doesn't multiply cost.

Scoring shape

Every score is an integer in 0..=10 with a one-sentence reasoning. Coulisse forces this shape through the preamble and parses the judge's JSON reply — you don't configure it.

If you need a different scale (e.g. boolean pass/fail, categorical), that will arrive as a future scale: field; the default stays numeric 0-10.

Sampling

sampling_rate controls what fraction of turns are scored.

ValueMeaning
1.0 (default)Score every turn.
0.1Roughly 10% of turns.
0.0Never score (useful to park a judge without deleting it).

The draw is independent per turn, per judge. Over many turns the scored fraction converges on the configured rate. Lower rates save tokens for expensive judges; broad cheap judges can run at 1.0.

Choosing a judge model

Pick a model that's different from the agent being scored whenever you can. A judge scoring its own output is biased — a cheap cross-provider judge (e.g. gpt-4o-mini judging a Claude agent, or vice versa) is usually closer to neutral.

Strong, slow models make sense for low-volume deep checks (sampling_rate: 0.1). Cheap, fast models make sense for high-volume broad checks (sampling_rate: 1.0).

Multiple judges per agent

Stack judges to get different dimensions at different cost points:

agents:
  - name: assistant
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    judges: [broad_check, deep_audit]

judges:
  - name: broad_check
    provider: openai
    model: gpt-4o-mini
    sampling_rate: 1.0
    rubrics:
      helpfulness: Whether the user's question was answered.
      tone:        Politeness and tone.

  - name: deep_audit
    provider: anthropic
    model: claude-opus-4
    sampling_rate: 0.05             # 5% of turns, expensive
    rubrics:
      accuracy:    Factual accuracy, including references and claims.
      safety:      Harmful, biased, or unsafe content.

Each judge is independent — its own model, rate, and rubric set. A turn can end up with zero, one, or both of these judges scoring it, depending on the sampling draw.

Viewing scores

The studio UI at /admin/ now shows a Scores panel per user. It surfaces two things:

  • Averages — mean score per (judge, criterion) across every turn the user has had, with sample count.
  • Recent — the most recent individual scores with reasoning.

Validation at startup

Coulisse fails fast on:

  • A judge referencing a provider that's not declared under providers:.
  • A judge with no rubrics.
  • A sampling_rate outside [0.0, 1.0].
  • An agent referencing a judge name that doesn't exist.

Any violation aborts startup with a message naming the offending judge or agent.

Cost control

Two knobs matter:

  1. sampling_rate — the easy one. Halve it, halve the judge bill.
  2. Judge model — the big one. A gpt-4o-mini judge at 100% sampling often costs less than a gpt-4o judge at 10%. Pick the cheapest model that gives you a stable signal.

A useful pattern is to run a cheap judge at 100% and a strong judge at a small fraction — the cheap one catches the broad signal, the strong one spot-checks the hardest cases.

Experiments (A/B testing)

Run multiple agent configurations under a single addressable name and let Coulisse pick which one serves each request. Useful for comparing models, preambles, or tool sets without changing client code.

How it works

  1. Define each candidate as a normal agent under agents:.
  2. Declare an experiment whose name is what clients send as model.
  3. List the candidate agents as variants and choose a strategy.

When a request arrives, the router resolves the experiment name to one variant (and optionally fires off shadow runs in the background). The variant choice is sticky-by-user by default, so the same user always lands on the same variant for a given experiment — conversation memory and persona stay consistent across turns.

Strategies

Three strategies are wired today: split, shadow, and bandit.

split

Weighted random sampling. Sticky by user when sticky_by_user: true (the default) — the variant is a deterministic hash of (user_id, experiment_name) modulo the cumulative weights, with no database writes. Adding or removing a variant reshuffles users.

agents:
  - name: assistant-sonnet
    provider: anthropic
    model: claude-sonnet-4-5-20250929
  - name: assistant-gpt
    provider: openai
    model: gpt-4o

experiments:
  - name: assistant            # what clients send as model
    strategy: split
    variants:
      - agent: assistant-sonnet
        weight: 0.5
      - agent: assistant-gpt
        weight: 0.5

shadow

Designate one variant as primary; it serves the user normally. The other variants run in the background against the same prepared context, are scored by their judges, and never write to the user's message history. The user never waits on shadow variants.

sampling_rate (default 1.0) controls how often shadow runs fire — set it lower to cap cost.

experiments:
  - name: assistant
    strategy: shadow
    primary: assistant-sonnet
    sampling_rate: 0.25       # 25% of turns also run the shadows
    variants:
      - agent: assistant-sonnet
      - agent: assistant-gpt

Use shadow to collect comparison data before flipping a split rollout — the primary serves all real traffic while you build up scoring evidence on the challenger.

bandit

Epsilon-greedy multi-armed bandit. Reads recent mean scores per variant from the existing scores table, picks the leader most of the time (1 - epsilon), and explores a random arm otherwise. Arms with fewer than min_samples recent scores are forced — the bandit only exploits once every arm has enough evidence.

agents:
  - name: assistant-sonnet
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    judges: [quality]
  - name: assistant-gpt
    provider: openai
    model: gpt-4o
    judges: [quality]

judges:
  - name: quality
    provider: openai
    model: gpt-4o-mini
    rubrics:
      helpfulness: Whether the assistant answered the user's question.

experiments:
  - name: assistant
    strategy: bandit
    metric: quality.helpfulness     # judge.criterion
    epsilon: 0.1
    min_samples: 30
    bandit_window_seconds: 604800   # 7 days
    variants:
      - agent: assistant-sonnet
      - agent: assistant-gpt

The configured judge (quality) and the criterion (helpfulness) must be declared on every variant agent — otherwise the bandit starves on that arm. Validation enforces this at startup.

A note on stickiness: with sticky_by_user: true (the default), the bandit decision is computed at request time via a deterministic hash of (user_id, experiment_name), so a given user typically lands on the same arm. Mean scores update as new data arrives, so a user can shift if a different arm overtakes the leader — that is the trade-off for keeping the assignment stateless.

Namespace and migration

Experiment names share a namespace with agent names. To A/B-test an existing agent without breaking clients:

  1. Rename the agent (assistantassistant-v1).
  2. Add a sibling agent (assistant-v2).
  3. Add an experiment named assistant with both as variants.

Clients keep sending model: assistant and it resolves transparently.

Variants stay individually addressable as agents under their own names (assistant-v1, assistant-v2) — useful for isolating one variant in tests or debugging.

Subagents

A subagent reference can name an agent or an experiment. If orchestrator lists subagents: [assistant] and assistant is an experiment, every subagent call resolves to a variant for the calling user, the same way a top-level request would. Sticky-by-user keeps the variant consistent across the whole conversation.

Give the experiment a purpose: if it's exposed as a subagent — it becomes the tool description the calling agent's LLM sees:

experiments:
  - name: assistant
    purpose: A general-purpose chat assistant.
    strategy: split
    variants:
      - agent: assistant-sonnet
      - agent: assistant-gpt

Bandit subagents read mean scores at call time, so the same exploit/explore behaviour applies inside subagent dispatch.

Telemetry

Each turn's TurnStart event includes agent (the resolved variant), and when an experiment was hit, experiment (the experiment name) and variant (same as agent). Judge scores are tagged with the variant's agent name in the database, so per-variant aggregation flows through the same table without a join — used by the bandit's mean-score query and the studio's per-variant view.

Studio

The studio shows configured experiments at /admin/experiments: strategy, sticky-by-user flag, and per-variant weight + share. For bandit experiments, the page additionally shows the configured metric, epsilon, and min-samples threshold, plus per-variant sample counts and mean scores (loaded inline via htmx from the judges admin endpoints). Shadow experiments call out the primary variant.

Validation

Coulisse rejects the following at startup:

  • Experiment name colliding with an agent name (rename one).
  • Experiment name colliding with another experiment.
  • Experiment with zero variants.
  • Variant referencing an undefined agent.
  • Variant weight <= 0.
  • Duplicate variant agent within one experiment.
  • Strategy-specific fields used with the wrong strategy (e.g. primary on a split experiment).
  • shadow without a primary, or with a primary that's not one of the variants.
  • shadow sampling_rate outside [0.0, 1.0].
  • bandit without a metric.
  • bandit metric that doesn't match an existing judge.criterion, or a variant that doesn't opt into the metric's judge.
  • bandit epsilon outside [0.0, 1.0].

Smoke tests

A smoke test is a synthetic-user persona that drives a conversation against one of your agents (or experiments). Coulisse plays the user — you write a preamble describing who they are and what they want — and the assistant replies for real. Every assistant turn flows through the same judge pipeline as production traffic, so you get a transcript and scores back without writing any harness code.

Smoke tests are most useful when you're iterating on a prompt: tweak the preamble, hit "Run now" in the studio, watch the scores. Pair them with experiments and a single click runs every variant once, sticky-by-user routing samples them across repetitions, and the judge scores feed straight into bandit selection.

How it works

  1. You trigger a run from the studio (/admin/smoke/<name>) — no client needed.
  2. Coulisse opens a fresh synthetic user id and starts a loop:
    • The persona model produces a "user" message — given the conversation so far with roles flipped (so the model speaks as the user).
    • The target agent replies as it normally would, with all its real MCP tools, subagents, and preambles.
    • The reply is fanned out to every judge the target agent opts into. Scores land in the same scores table as production runs, keyed by the assistant turn's id.
  3. The loop stops when either side emits the configured stop_marker, or when max_turns is hit.
  4. The full transcript is browsable at /admin/smoke/runs/<run_id> — assistant in slate, persona in amber.

Smoke runs never write to the user's memory or rate-limit windows. Each repetition uses a brand-new synthetic user id, so split/bandit experiments naturally sample variants across reps.

YAML

smoke_tests:
  - name: jobseeker_basic
    target: tremplin                 # agent or experiment name
    persona:
      provider: anthropic
      model: claude-haiku-4-5-20251001
      preamble: |
        You are role-playing a 28-year-old looking for a developer job in Paris.
        Reply like a real human: short questions, follow-ups as the conversation goes.
        When you have a satisfactory answer, finish with "[FIN]".
    initial_message: "Hi, I'm looking for work."
    stop_marker: "[FIN]"
    max_turns: 10
    repetitions: 5
FieldRequiredDefaultNotes
nameyesUnique within smoke_tests. Shows up at /admin/smoke/<name>.
targetyesAgent name or experiment name. Resolved through the experiment router per run.
personayesProvider, model, and preamble for the synthetic user.
initial_messagenoHard-coded first message from the persona. Skipping this lets the persona open the conversation.
stop_markernoSubstring that ends the run when emitted by either side.
max_turnsno10Cap on persona-then-agent pairs.
repetitionsno1Independent runs launched per "Run now" click. Each gets a fresh synthetic user id.

Iterating with experiments

Define two variants of an agent (e.g. assistant-v1, assistant-v2), wrap them in a bandit experiment, and target the experiment name from a smoke test:

experiments:
  - name: assistant
    strategy: bandit
    metric: quality.helpfulness
    variants:
      - agent: assistant-v1
      - agent: assistant-v2

smoke_tests:
  - name: convergence
    target: assistant
    repetitions: 50
    persona: { provider: openai, model: gpt-4o-mini, preamble: "..." }

Hit "Run now" once and the bandit accumulates 50 samples per variant per turn pair. The experiment page picks the winner on its own.

Limitations (today)

  • Smoke runs bypass the memory pipeline. Fact extraction and semantic recall are not exercised.
  • No scheduled runs — trigger is manual via the studio.
  • No tool-call assertions; assertions about what the agent did during a turn live in the judge rubrics.

Telemetry

Coulisse emits its own observability via the tracing crate. Every request opens a turn span; every tool invocation (MCP or subagent) opens a child tool_call span. The configured layers — fmt, SQLite, and optionally OTLP — receive those spans and route them where you've asked for.

The result: the studio UI gives you an offline audit trail, and any OpenTelemetry-compatible backend (Grafana, SigNoz, Jaeger, Honeycomb, ...) gives you live traces. They're driven from the same source — there's no separate path.

Span model

Span nameOpened whenFields
turna chat completion request arrivesagent, experiment (when applicable), turn_id, user_id, user_message
tool_callan MCP or subagent tool firesargs, error (on failure), kind (mcp/subagent), result, tool_name
llm_calla chat completion finishes (token usage is known)cost_usd (when the model is in the pricing table), model, provider, usage

turn is the root; tool_call and llm_call nest under it via the tracing span tree, so OTLP backends render them as a trace tree out of the box.

Studio integration

When telemetry.sqlite.enabled is true (the default), the studio's per-turn event tree and tool-call panel render directly from the same spans. Nothing extra to wire up — open /admin/ and the tree is there.

OTLP backends

Set telemetry.otlp.endpoint to start exporting. The exporter batches spans, retries on transient failures, and shuts down cleanly on process exit so in-flight spans land before the server stops.

Tested with:

  • Grafana (Tempo / Cloud) — gRPC at 4317.
  • SigNoz (self-hosted or Cloud) — gRPC; for Cloud add a signoz-access-token header.
  • Jaeger — gRPC at 4317 (Jaeger ≥ 1.50 speaks OTLP natively).
  • Honeycomb — HTTP/protobuf at https://api.honeycomb.io/v1/traces with x-honeycomb-team header.

Tuning verbosity

The fmt layer (stderr logs) is controlled by RUST_LOG:

RUST_LOG=info,sqlx=warn coulisse        # default
RUST_LOG=debug coulisse                 # verbose, including SQL driver
RUST_LOG=warn coulisse                  # quiet
RUST_LOG=coulisse=debug,agents=trace coulisse   # per-crate filtering

The SQLite and OTLP layers are not affected by RUST_LOG — they capture every turn / tool_call / llm_call span regardless of log level.

Disabling layers

Each layer has its own enabled flag. Common combinations:

# Production with external observability stack
telemetry:
  sqlite:
    enabled: false      # studio not exposed; no need to keep DB rows
  otlp:
    endpoint: "..."
# Local development, no external backend
telemetry:
  # default fmt + sqlite
# CI / load tests — minimize logging overhead
telemetry:
  fmt:
    enabled: false
  sqlite:
    enabled: false

CLI reference

Coulisse ships as a single binary with a handful of subcommands. Every subcommand accepts -c, --config <PATH> (default coulisse.yaml) and honors the COULISSE_CONFIG env var as a fallback.

State files (coulisse.pid, coulisse.log) live in a .coulisse/ directory next to the config file — this keeps state co-located with the project and makes cd && coulisse stop "just work."

coulisse init

Write a starter coulisse.yaml in the current directory.

coulisse init                 # minimal template (one OpenAI agent + sqlite memory)
coulisse init --from-example  # full annotated example (every section, every option)
coulisse init --force         # overwrite an existing coulisse.yaml

coulisse start

Start the server, detached by default. Returns once the server has written its PID file (or fails if the boot times out within 5 seconds).

coulisse start                # detached background server
coulisse start --foreground   # attached: logs stream to the terminal
coulisse start -F             # short form

A bare coulisse invocation is equivalent to coulisse start --foreground — the historical pre-subcommand behavior is preserved.

When detached, stdout/stderr are appended to .coulisse/coulisse.log.

coulisse stop

Send SIGTERM to a running detached server (PID read from .coulisse/coulisse.pid).

coulisse stop          # graceful: SIGTERM, wait up to 10s
coulisse stop --force  # SIGKILL (use if the server is wedged)

Stop is a no-op if the server isn't running — stale PID files left over from crashes are detected and removed.

coulisse restart

Equivalent to coulisse stop && coulisse start.

coulisse status

Report whether the detached server is running and where its files live.

running (pid 31427)
  config: ./coulisse.yaml
  log:    ./.coulisse/coulisse.log

coulisse check

Load and validate the YAML without starting the server. Catches schema errors and cross-reference issues (agent → provider, agent → judge, experiment variant → agent, ...) before a real start.

coulisse check
# ok — coulisse.yaml (3 agents, 1 judges, 0 experiments, 2 providers)

coulisse update

Fetch the latest release from GitHub and replace the running binary in place. Detects the host target triple (e.g. aarch64-apple-darwin) and downloads the matching cargo-dist artifact. No-op if you're already on the latest version.

coulisse update
# checking for updates...
# updated to 0.2.0

The binary needs write permission to its own path — if you installed under /usr/local/bin you may need sudo.

State directory layout

your-project/
├── coulisse.yaml
└── .coulisse/
    ├── coulisse.pid     # written by `start`, removed on clean exit
    ├── coulisse.log     # detached stdout/stderr
    └── memory.db        # if you point memory.backend.path here

.coulisse/ is the recommended target for memory.backend.path so the whole runtime footprint of one project sits under a single directory.

HTTP API

Coulisse listens on 0.0.0.0:8421 and exposes an OpenAI-compatible surface.

POST /v1/chat/completions

The main chat endpoint. Accepts the standard OpenAI chat completion request shape.

Request

{
  "model": "assistant",
  "safety_identifier": "user-123",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}
FieldRequiredNotes
messagesyesThe usual OpenAI message array. At least one user message is required.
metadatanoOptional map of strings. Used for per-request rate limits — see below.
modelyesName of an agent from your config.
safety_identifieryes¹Identifies the user. Can be any stable string.
streamnoWhen true, the response is an SSE stream of chat.completion.chunk frames. See Streaming responses.
stream_optionsnoObject. include_usage: true adds the usage field to the terminal stream chunk.
userDeprecated OpenAI field; accepted as a fallback.

¹ Required unless a default_user_id is set in coulisse.yaml — see User identification.

Recognized metadata keys

metadata is a passthrough map of strings. Coulisse interprets the following keys; any other keys are ignored.

KeyTypeMeaning
languageBCP 47 tagForces the response language, e.g. fr-FR. See Response language.
tokens_per_dayinteger (as string)Max tokens per rolling day.
tokens_per_hourinteger (as string)Max tokens per rolling hour.
tokens_per_monthinteger (as string)Max tokens per rolling 30-day window.

All optional. See Rate limiting for the token-limit behavior.

Response

Standard OpenAI chat completion response:

{
  "id": "...",
  "object": "chat.completion",
  "created": 1714000000,
  "model": "assistant",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "Hi!"},
      "finish_reason": "stop"
    }
  ]
}

Streaming

Set stream: true to receive chat.completion.chunk frames over Server-Sent Events instead of one JSON response. The full wire format and disconnect semantics live in Streaming responses.

Errors

Errors come back in OpenAI's error shape:

{
  "error": {
    "type": "invalid_request_error",
    "message": "safety_identifier is required",
    "code": null
  }
}

Common cases:

  • 400 — missing safety_identifier (when required), no user message, unknown agent name, unparseable metadata values.
  • 429 — per-user token limit exceeded. Includes a Retry-After header with seconds until the window resets. See Rate limiting.
  • 5xx — upstream provider error, MCP server failure.

GET /v1/models

Lists every agent defined in the config.

Response

{
  "object": "list",
  "data": [
    {"id": "assistant", "object": "model", "owned_by": "coulisse"},
    {"id": "code-reviewer", "object": "model", "owned_by": "coulisse"}
  ]
}

Useful for UI dropdowns that want to populate a model picker from the server.

Admin / config endpoints

Everything under /admin/* is a single content-negotiated surface. The same routes serve HTML pages to browsers, HTML fragments to htmx, and JSON to scripts — set Accept: application/json (or send an HX-Request header) to switch representation. Request bodies are equally tolerant: application/json, application/yaml, and application/x-www-form-urlencoded all deserialize into the same target type.

All admin routes are gated by the auth.admin scope.

Agents

MethodPathBodyNotes
GET/admin/agentsList configured agents (HTML or JSON).
POST/admin/agentsAgentConfigCreate a new agent. 409 if the name is taken.
GET/admin/agents/{name}Detail (HTML or JSON).
PUT/admin/agents/{name}AgentConfigReplace the named agent. Body name must match URL.
DELETE/admin/agents/{name}Remove the named agent.
GET/admin/agents/newHTML form for a new agent.
GET/admin/agents/{name}/editHTML edit form.

AgentConfig is the same shape used in coulisse.yaml: name, provider, model, preamble, purpose (optional), judges (list, optional), subagents (list, optional), mcp_tools (list, optional).

Judges, experiments, providers, MCP servers

Same CRUD shape as agents — list / create / one / update / delete. Adjust the path to suit:

PathBodyNotes
/admin/judges + /admin/judges/{name}JudgeConfigLLM-as-judge evaluators.
/admin/experiments + /admin/experiments/{name}ExperimentConfigA/B routing groups. The runtime ExperimentRouter rebuilds on restart; admin display reflects the file in real time.
/admin/providers + /admin/providers/{kind}ProviderConfig (just api_key); POST body adds kindWhere {kind} is one of anthropic, cohere, deepseek, gemini, groq, openai. The runtime client is built at boot — restart to swap.
/admin/mcp + /admin/mcp/{name}McpServerConfig (transport: stdio + command/args/env, or transport: http + url); POST body adds nameConnections open at boot — restart to attach a new server.

Whole-file config

MethodPathBodyNotes
GET/admin/configReturns the file contents (application/yaml by default, JSON when Accept: application/json).
PUT/admin/configfull YAML/JSONReplaces coulisse.yaml atomically. Validation runs before any disk write.
GET/admin/openapi.jsonOpenAPI 3.1 description of every admin route, including request/response schemas. Feed it to openapi-generator or any client codegen for typed SDKs.

Validation, hot reload, the file watcher

Every write — admin form save, JSON PUT, hand-edit in $EDITOR — flows through the same pipeline:

  1. The body is merged into the on-disk YAML (preserving sections this binary doesn't recognize).
  2. The full result is deserialized into a Config and run through cross-feature validation (provider references, judge references, experiment variants, …).
  3. Only on success does anything touch disk: a temp file is written and renamed atomically.
  4. The file watcher fires, the new config is reloaded, and feature crates' hot-reloadable state (agent list, judges list, experiments list, settings view) atomically swaps in.

Errors return the validator's message verbatim with a 422 Unprocessable Entity (or 400 for malformed bodies). The on-disk file is unchanged when validation rejects a write.

The studio UI is just one client of these endpoints — see Studio UI for what the rendered surface offers and authentication options.

Auth

By default Coulisse leaves /v1/* open. Configure the auth.proxy scope in YAML to require Basic credentials or OIDC for SDK clients; configure auth.admin to gate the studio. See Studio UI for the schema. Anything you don't gate is your responsibility to terminate at the infrastructure layer (reverse proxy, API gateway, VPN).

YAML schema

A complete reference for every field in coulisse.yaml.

Top-level

agents: [ ... ]               # required, non-empty
auth: { ... }                 # optional; per-scope auth for /v1/* and /admin/*
default_user_id: <string>     # optional, unset by default
experiments: [ ... ]          # optional; A/B test groups over agents
judges: [ ... ]               # optional; empty/omitted = no evaluation
mcp: { ... }                  # optional
memory: { ... }               # optional; defaults to sqlite + hash embedder
providers: { ... }            # required
smoke_tests: [ ... ]          # optional; synthetic-user evaluation runs
telemetry: { ... }            # optional; fmt + sqlite on by default, OTLP opt-in

auth

  • Type: object
  • Optional. Omit to leave both surfaces unauthenticated (fine for local dev, never for anything exposed beyond loopback).

Two independent scopes:

  • auth.proxy guards the OpenAI-compatible /v1/* surface that SDK clients call.
  • auth.admin guards the /admin/* surface (the studio UI).

Each scope is itself optional and accepts the same shape: exactly one of basic or oidc when present. They are mutually exclusive within a scope — the server rejects a scope block that has both or neither. The two scopes are independent, so you can enable Basic on one and OIDC on the other.

auth.<scope>.basic

Static HTTP Basic credentials. Best for local dev or a single-operator deployment.

FieldTypeRequiredDefaultNotes
passwordstringyesNon-empty. Rotate if suspected leaked — there's no token revocation.
usernamestringnoadminNon-empty when set.
auth:
  admin:
    basic:
      password: choose-something-strong
      username: admin

auth.<scope>.oidc

Authorization-code-with-PKCE login against an OIDC-compliant IdP (Authentik, Keycloak, Auth0, Google, etc.). Access control is delegated to the IdP's application policy — Coulisse accepts any successfully authenticated user.

FieldTypeRequiredDefaultNotes
client_idstringyesMust match the client registered at the IdP.
client_secretstringnoRequired for confidential clients (Authentik's default); omit for public clients using PKCE only.
issuer_urlstringyesIdP issuer. For Authentik: https://<host>/application/o/<app-slug>/.
redirect_urlstringyesPublic base URL inside the protected scope. Must be registered as the redirect URI at the IdP. axum-oidc allows every subpath of this URL as a valid redirect.
scopeslist<string>no[email, profile]Extra OAuth2 scopes. openid is added automatically.
auth:
  admin:
    oidc:
      issuer_url:    https://authentik.example.com/application/o/coulisse/
      client_id:     coulisse-admin
      client_secret: <secret>
      redirect_url:  http://localhost:8421/admin/

default_user_id

  • Type: string
  • Default: unset
  • Purpose: fallback identifier for requests that don't supply safety_identifier (or the deprecated user).

Leave it unset for multi-tenant deployments — unidentified requests will be rejected. Set it to something like "main" for local or single-user setups so memory still works whether or not the client bothers to send an id. See User identification.

providers

  • Type: map of provider_kind → provider_config
  • Required. At least one provider must be declared.

Supported keys

anthropic, cohere, deepseek, gemini, groq, openai.

Per-provider fields

FieldTypeRequiredNotes
api_keystringyesProvider API key.
providers:
  anthropic:
    api_key: sk-ant-...
  openai:
    api_key: sk-...

mcp

  • Type: map of server_name → server_config
  • Optional. Omit if you don't use tools.

Server names are arbitrary — they're what agents refer to under mcp_tools.

Common fields

FieldTypeRequiredNotes
transportenumyesstdio or http.

transport: stdio

FieldTypeRequiredNotes
commandstringyesExecutable to run.
argslist<str>noCommand-line arguments.
envmap<str,str>noEnvironment variables for the child.

transport: http

FieldTypeRequiredNotes
urlstringyesStreamable-HTTP MCP endpoint.

Examples

mcp:
  hello:
    transport: stdio
    command: uvx
    args: [--from, git+https://..., hello-mcp-server]

  calculator:
    transport: http
    url: http://localhost:8080

memory

  • Type: object
  • Optional. Omit for defaults (sqlite at ./coulisse-memory.db, offline hash embedder, no auto-extraction).

See Memory configuration for the full walkthrough and examples.

Sub-fields

FieldTypeRequiredDefault
backend.kindenumnosqlite
backend.pathstringno./coulisse-memory.db
embedder.providerenumnohash
embedder.modelstringdependsrequired for openai/voyage
embedder.api_keystringnofalls back to providers.<provider>
embedder.dimsintno32 (hash only)
extractor.providerstringyes*— (* required when extractor is set)
extractor.modelstringyes*
extractor.dedup_thresholdfloatno0.9
extractor.max_facts_per_turnintno5
context_budgetintno8000
memory_budget_fractionfloatno0.1
recall_kintno5

agents

  • Type: list of agent configs
  • Required. At least one agent must be defined.

Per-agent fields

FieldTypeRequiredNotes
namestringyesUnique agent identifier; clients pass this as model.
providerstringyesKey under providers.
modelstringyesUpstream model identifier.
preamblestringnoSystem prompt. Default: empty.
judgeslist<string>noNames of judges (from top-level judges:) that evaluate this agent's replies. Empty = no evaluation.
mcp_toolslist<mcp_tool_access>noTools this agent may use.
purposestringnoTool description when this agent is exposed via another agent's subagents. Omit for standalone agents; add a concrete one-line description when this agent is meant to be called as a specialist.
subagentslist<string>noNames of other agents exposed as callable tools. Each entry must refer to another entry under agents. Self-reference and duplicates are rejected at startup.

mcp_tools entry

FieldTypeRequiredNotes
serverstringyesKey under mcp.
onlylist<str>noAllowed tool names. Omit for full access.

Complete agent example

agents:
  - name: code-reviewer
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    preamble: |
      You are a thorough code reviewer.
    mcp_tools:
      - server: filesystem
        only:
          - read_file
      - server: hello

Subagent example

agents:
  - name: resume_critic
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    purpose: Critique and rewrite a resume for a target role.
    preamble: |
      Given a resume and a target role, return a revised resume
      and a bullet list of the biggest gaps.

  - name: coach
    provider: anthropic
    model: claude-sonnet-4-5-20250929
    subagents: [resume_critic]
    preamble: |
      Delegate resume work to `resume_critic` when relevant.

See Multi-agent routing for the full subagent walkthrough.

experiments

  • Type: list of experiment configs
  • Optional. Omit (or leave empty) to skip A/B testing.

An experiment wraps two or more agents under one addressable name. Clients send the experiment's name in the model field and the router picks a variant per request. Experiment names share the agent namespace — collisions are rejected at startup.

See Experiments for the end-to-end walkthrough.

Per-experiment fields

FieldTypeRequiredDefaultNotes
bandit_window_secondsintno (bandit)604800 (7 d)Bandit-only. Maximum age of scores included in mean-arm computations.
epsilonfloatno (bandit)0.1Bandit-only. Probability in [0.0, 1.0] of routing to a random arm instead of the leader.
metricstringyes (bandit)Bandit-only. judge.criterion to optimise. The judge must declare the criterion in its rubrics, and every variant must opt into the judge.
min_samplesintno (bandit)30Bandit-only. Each arm must accumulate this many scores before exploitation is allowed.
namestringyesAddressable name; must not collide with any agent name.
primarystringyes (shadow)Shadow-only. Variant agent that serves the user. Must be one of variants.
purposestringnoTool description when the experiment is exposed via another agent's subagents:.
sampling_ratefloatno (shadow)1.0Shadow-only. Probability in [0.0, 1.0] that a turn also runs the non-primary variants in the background.
sticky_by_userboolnotrueWhen true, the same user always lands on the same variant (deterministic hash, no DB writes).
strategyenumyessplit, shadow, or bandit.
variantslist<variant>yesNon-empty. Each entry references an agent.

variants entry

FieldTypeRequiredDefaultNotes
agentstringyesName of an agent declared under top-level agents:. Variants must reference concrete agents — nesting an experiment is rejected.
weightfloatno1.0Strictly positive. Normalised against the sum of all variant weights.

Example

agents:
  - name: assistant-sonnet
    provider: anthropic
    model: claude-sonnet-4-5-20250929
  - name: assistant-gpt
    provider: openai
    model: gpt-4o

experiments:
  - name: assistant
    strategy: split
    variants:
      - agent: assistant-sonnet
        weight: 0.5
      - agent: assistant-gpt
        weight: 0.5

judges

  • Type: list of judge configs
  • Optional. Omit (or leave empty) for no automatic evaluation.

Judges are background LLM-as-judge evaluators. An agent opts in by listing judge names in its own judges: field. See LLM-as-judge evaluation for the full walkthrough.

Per-judge fields

FieldTypeRequiredDefaultNotes
namestringyesUnique judge identifier; agents refer to it here.
providerstringyesMust match a key under providers.
modelstringyesUpstream model identifier for the judge call.
rubricsmap<string,string>yescriterion: short description of what to assess. One score row per criterion per scored turn. Must declare at least one entry.
sampling_ratefloatno1.0In [0.0, 1.0]. 1.0 = every turn, 0.1 ≈ 10%, 0.0 = never.

Rubric descriptions should say what to evaluate — don't include scale, JSON, or format instructions. Coulisse forces the output shape internally (integer 0-10 per criterion with a one-sentence reasoning).

Example

judges:
  - name: quality
    provider: openai
    model: gpt-4o-mini
    sampling_rate: 1.0
    rubrics:
      accuracy:     Factual accuracy. Flag hallucinations.
      helpfulness:  Whether the assistant answered the user's question.
      tone:         Politeness and tone.

smoke_tests

  • Type: list of smoke test configs
  • Optional. Omit (or leave empty) for no synthetic-user runs.

Each entry pairs a persona (an LLM that role-plays the user) with a target agent or experiment. Triggered from the studio at /admin/smoke/<name>. See Smoke tests for the workflow.

Per-test fields

FieldTypeRequiredDefaultNotes
namestringyesUnique within smoke_tests.
targetstringyesAgent or experiment name. Resolved per run via the experiment router.
personaobjectyesprovider, model, preamble for the role-played user.
initial_messagestringnoHard-coded first persona turn. Omit to let the persona open the conversation.
stop_markerstringnoSubstring that ends the run when emitted by either side.
max_turnsintegerno10Cap on persona-then-agent pairs per run.
repetitionsintegerno1Independent runs launched per click. Each gets a fresh synthetic user id.

Example

smoke_tests:
  - name: jobseeker_basic
    target: tremplin
    persona:
      provider: anthropic
      model: claude-haiku-4-5-20251001
      preamble: |
        You are a 28-year-old looking for a developer job in Paris.
        Reply like a real human; finish with "[FIN]" once satisfied.
    initial_message: "Hi, I'm looking for work."
    stop_marker: "[FIN]"
    max_turns: 10
    repetitions: 5

telemetry

  • Type: object
  • Optional. Omit and Coulisse runs with stderr fmt logs at info plus the SQLite mirror that drives the studio UI; no external traces.

The block has three sub-sections — fmt, sqlite, and otlp — each independently toggleable. See Telemetry configuration for the full schema and Telemetry & OpenTelemetry for span semantics and OTLP backend integration.

telemetry:
  fmt:
    enabled: true        # default
  sqlite:
    enabled: true        # default; powers the studio UI
  otlp:                  # absent = no external traces
    endpoint: "http://localhost:4317"
    protocol: grpc       # or http_binary
    service_name: coulisse
    headers:
      authorization: "Bearer ${OTEL_API_KEY}"

Validation

On startup, Coulisse checks:

  • Each present auth scope (proxy, admin) declares exactly one of basic or oidc.
  • auth.<scope>.basic.password and auth.<scope>.basic.username are non-empty.
  • auth.<scope>.oidc.client_id, issuer_url, and redirect_url are non-empty.
  • There is at least one agent.
  • Agent names are unique.
  • Every agent's provider is configured.
  • Every referenced MCP server is configured.
  • Every name in subagents refers to a defined agent or experiment.
  • No agent lists itself under subagents.
  • subagents entries are unique within an agent (no duplicates).
  • Experiment names are unique and do not collide with any agent name.
  • Each experiment declares at least one variant.
  • Each variant references a defined agent and has a strictly positive weight.
  • Variant agents within an experiment are unique.
  • Strategy-specific fields are only set on the matching strategy (e.g. primary only on shadow, metric only on bandit).
  • For shadow: primary is set and matches one of the variants; sampling_rate is in [0.0, 1.0].
  • For bandit: metric is judge.criterion; the judge exists, declares the criterion in its rubrics, and every variant opts into the judge; epsilon is in [0.0, 1.0].
  • Every referenced judge exists.
  • Judge names are unique.
  • Every judge's provider is configured and supported.
  • Every judge has at least one rubric.
  • Every judge's sampling_rate is in [0.0, 1.0].

Any violation fails fast with an error message that names the offending agent or judge and field.

Releasing

Coulisse follows Semantic Versioning. Pre-1.0, minor bumps may include breaking changes to the YAML schema, HTTP surface, or CLI; patch bumps will not.

Cutting a release

  1. Bump the version in the workspace Cargo.toml:

    [workspace.package]
    version = "0.2.0"
    

    All workspace crates inherit this via version.workspace = true, so this is the only place to edit.

  2. Update CHANGELOG.md — rename the ## [Unreleased] section to ## [0.2.0] - YYYY-MM-DD and start a fresh ## [Unreleased] block above it.

  3. Commit, tag, push:

    git commit -am "Release v0.2.0"
    git tag v0.2.0
    git push && git push --tags
    

The v*.*.* tag triggers two workflows:

  • release.yml (cargo-dist) — builds binaries and installers for macOS (x86 + ARM), Linux GNU (x86 + ARM), and Windows MSVC, then publishes them as a GitHub Release with auto-generated notes.
  • docker.yml — builds a multi-arch image and pushes to ghcr.io/almaju/coulisse tagged latest, 0.2, and 0.2.0.

Hotfixes

For patch releases on the latest minor, branch from the previous tag, fix forward, then tag v0.2.1 from that branch. The same workflow handles it.

Roadmap

What's in Coulisse today, and what's coming.

Working today

  • Multi-agent routing via the model field.
  • Agents as tools — expose one agent to another under subagents: with a purpose: description. Nested invocations are bounded by a depth cap.
  • Per-user conversation history with isolation.
  • Long-term memory with semantic recall — persistent via SQLite and backed by a real embedder (OpenAI or Voyage AI; hash fallback for offline dev).
  • Auto-extraction — an optional background task pulls durable facts from each exchange and deduplicates them before storing.
  • Tunable memory budgets (context_budget, memory_budget_fraction, recall_k) in YAML.
  • Multi-backend support (Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq).
  • OpenAI-compatible HTTP API (/v1/chat/completions, /v1/models).
  • Read-only studio UI at /admin/ for browsing conversations, memories, and judge scores.
  • LLM-as-judge evaluation — background scoring of agent replies against YAML-defined rubrics, with per-judge sampling and per-user persistence.
  • Experiments (A/B testing) — wrap multiple agents under one addressable name and route traffic between them with sticky-by-user defaults. Three strategies: split (weighted random), shadow (primary serves the user, others run in the background and are scored), and bandit (epsilon-greedy on a single judge criterion).
  • Streaming responses over SSE (stream: true, with stream_options.include_usage).
  • MCP tool integration over stdio and HTTP, with per-agent filtering.
  • Per-user token rate limiting (hour / day / month).
  • YAML-driven config with startup validation.
  • Docker image with a volume-mounted SQLite store.

Planned

Durable rate-limit state

Current rate-limit counters live in memory — they reset on restart and don't span multiple instances. A durable, shared backend is planned so quotas survive reboots and horizontal scaling.

Workflow orchestration

Chaining agents into declarative pipelines (one agent's output feeds the next, with conditional routing) — all configured in YAML rather than app code.

Vector index for large memory stores

Recall currently does a linear cosine scan over all memories for the user. Fine at hundreds-to-low-thousands of memories per user, but a vector index will be needed if per-user memory counts grow into the tens of thousands.

Per-agent memory overrides

Today the memory: block is global. A future revision will allow per-agent scoping (different embedders or budgets per agent) for cases where one agent handles long-form research and another handles short user chat.


This list reflects what's on deck at the time of writing — check the repository for the current state.