Streaming responses
Coulisse implements OpenAI's Server-Sent Events (SSE) format for chat completions. Set stream: true in the request and the server emits incremental chat.completion.chunk frames over the wire — drop-in compatible with the OpenAI Python and JavaScript SDKs and any client that already speaks the OpenAI streaming protocol.
Asking for a stream
Add stream: true to a normal /v1/chat/completions request:
{
"model": "assistant",
"safety_identifier": "user-123",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}
The response is text/event-stream instead of JSON. Each frame is one chat.completion.chunk.
Wire format
The first frame announces the assistant role:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"role":"assistant"}}]}
Then one frame per text delta:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":"Hello"}}]}
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{"content":" there"}}]}
A terminal frame sets finish_reason:
data: {"id":"chatcmpl-coulisse-...","object":"chat.completion.chunk","created":...,"model":"assistant","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Including token usage
Set stream_options.include_usage: true to receive a usage field on the terminal chunk:
{
"model": "assistant",
"messages": [{"role": "user", "content": "Hi"}],
"stream": true,
"stream_options": {"include_usage": true}
}
The terminal frame then carries usage:
data: {"...":"...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"completion_tokens":3,"prompt_tokens":7,"total_tokens":10}}
When include_usage is missing or false, the field is omitted — matching OpenAI's contract.
Memory and rate limiting
Streaming responses use the same per-user memory bucket and rate-limit accounting as non-streaming requests:
- The user's message and the assistant's reply are appended to memory after the stream ends.
- Token usage is recorded against the rate-limit window when the stream ends.
- If the client disconnects mid-stream, Coulisse persists the partial assistant reply (everything received before the disconnect). This matches what the user actually saw — the next turn won't claim the model said something the user never received.
Tool-using agents
Agents with MCP tools attached stream the same way. Tool-call internals run inside the rig multi-turn loop and are not surfaced to the client; you'll see a pause while a tool runs, then the model's text continues. The delta.content field is the only delta variant Coulisse currently emits.
Errors mid-stream
If the upstream provider fails after the stream has started, Coulisse emits one terminal frame containing an error field with the failure reason, then [DONE]. The HTTP status is already 200 by then — clients should check for the error field on the final chunk.