Roadmap
What's in Coulisse today, and what's coming.
Working today
- Multi-agent routing via the
modelfield. - Agents as tools — expose one agent to another under
subagents:with apurpose:description. Nested invocations are bounded by a depth cap. - Per-user conversation history with isolation.
- Long-term memory with semantic recall — persistent via SQLite and backed by a real embedder (OpenAI or Voyage AI;
hashfallback for offline dev). - Auto-extraction — an optional background task pulls durable facts from each exchange and deduplicates them before storing.
- Tunable memory budgets (
context_budget,memory_budget_fraction,recall_k) in YAML. - Multi-backend support (Anthropic, OpenAI, Gemini, Cohere, Deepseek, Groq).
- OpenAI-compatible HTTP API (
/v1/chat/completions,/v1/models). - Read-only studio UI at
/admin/for browsing conversations, memories, and judge scores. - LLM-as-judge evaluation — background scoring of agent replies against YAML-defined rubrics, with per-judge sampling and per-user persistence.
- Experiments (A/B testing) — wrap multiple agents under one addressable name and route traffic between them with sticky-by-user defaults. Three strategies:
split(weighted random),shadow(primary serves the user, others run in the background and are scored), andbandit(epsilon-greedy on a single judge criterion). - Streaming responses over SSE (
stream: true, withstream_options.include_usage). - MCP tool integration over stdio and HTTP, with per-agent filtering.
- Per-user token rate limiting (hour / day / month).
- YAML-driven config with startup validation.
- Docker image with a volume-mounted SQLite store.
Planned
Durable rate-limit state
Current rate-limit counters live in memory — they reset on restart and don't span multiple instances. A durable, shared backend is planned so quotas survive reboots and horizontal scaling.
Workflow orchestration
Chaining agents into declarative pipelines (one agent's output feeds the next, with conditional routing) — all configured in YAML rather than app code.
Vector index for large memory stores
Recall currently does a linear cosine scan over all memories for the user. Fine at hundreds-to-low-thousands of memories per user, but a vector index will be needed if per-user memory counts grow into the tens of thousands.
Per-agent memory overrides
Today the memory: block is global. A future revision will allow per-agent scoping (different embedders or budgets per agent) for cases where one agent handles long-form research and another handles short user chat.
This list reflects what's on deck at the time of writing — check the repository for the current state.