Streaming
Stream chat completion responses in real time using server-sent events (SSE).
Streaming#
Streaming delivers tokens to the client as they are generated, instead of waiting for the entire response to complete. This is essential for chat interfaces and real-time applications where you want to display text as it appears.
How it works#
When you set stream: true in your request, the API returns a stream of server-sent events (SSE). Each event contains a chunk of the response with one or more new tokens.
The stream follows this pattern:
- The connection is opened and the server begins generating tokens
- Each generated token (or small group of tokens) is sent as a
data:event - A final chunk includes
usagestatistics (token counts) - The stream ends with
data: [DONE]
SSE format#
Each event is a line starting with data: followed by a JSON object:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"llama-4-maverick","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"llama-4-maverick","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"llama-4-maverick","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":2,"total_tokens":12}}
data: [DONE]
Chunk structure#
Each chunk has the same structure as a non-streaming response, but uses delta instead of message:
| Field | Description |
|---|---|
choices[].delta.role | Present only in the first chunk (always "assistant") |
choices[].delta.content | New token(s) in this chunk (may be empty or absent) |
choices[].delta.tool_calls | Tool call data (when the model invokes tools) |
choices[].finish_reason | null during generation; set on the final content chunk |
usage | Token counts — included in the last chunk before [DONE] |
Python streaming#
The OpenAI Python SDK handles SSE parsing automatically:
from openai import OpenAI
client = OpenAI(
base_url="https://api.oru-el.com/v1/inference",
api_key="oruel_your_api_key_here",
)
stream = client.chat.completions.create(
model="llama-4-maverick",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about programming."},
],
stream=True,
max_tokens=100,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print() # newline at the end
Collecting the full response#
from openai import OpenAI
client = OpenAI(
base_url="https://api.oru-el.com/v1/inference",
api_key="oruel_your_api_key_here",
)
stream = client.chat.completions.create(
model="llama-4-maverick",
messages=[{"role": "user", "content": "Explain recursion."}],
stream=True,
)
collected_content = []
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
collected_content.append(delta.content)
# Check for usage in the final chunk
if chunk.usage:
print(f"\nTokens used: {chunk.usage.total_tokens}")
full_response = "".join(collected_content)
print(full_response)
Async streaming#
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://api.oru-el.com/v1/inference",
api_key="oruel_your_api_key_here",
)
async def main():
stream = await client.chat.completions.create(
model="llama-4-maverick",
messages=[{"role": "user", "content": "Tell me a short story."}],
stream=True,
)
async for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
asyncio.run(main())
JavaScript streaming#
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.oru-el.com/v1/inference",
apiKey: "oruel_your_api_key_here",
});
const stream = await client.chat.completions.create({
model: "llama-4-maverick",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Write a haiku about programming." },
],
stream: true,
max_tokens: 100,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
console.log();
Browser (fetch API)#
For browser-based applications, you can parse the SSE stream directly:
async function streamChat(messages) {
const response = await fetch(
"https://api.oru-el.com/v1/inference/chat/completions",
{
method: "POST",
headers: {
Authorization: "Bearer oruel_your_api_key_here",
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "llama-4-maverick",
messages,
stream: true,
}),
}
);
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop(); // keep incomplete line in buffer
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6).trim();
if (data === "[DONE]") return;
const parsed = JSON.parse(data);
const content = parsed.choices[0]?.delta?.content;
if (content) {
// Append to your UI
document.getElementById("output").textContent += content;
}
}
}
}
cURL streaming#
curl https://api.oru-el.com/v1/inference/chat/completions \
-H "Authorization: Bearer oruel_your_api_key_here" \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "llama-4-maverick",
"messages": [
{"role": "user", "content": "Write a haiku about programming."}
],
"stream": true
}'
The -N flag disables cURL's output buffering so you see tokens as they arrive.
Usage statistics#
Token usage is included in the final chunk of the stream (the chunk just before data: [DONE]). The usage field contains:
{
"usage": {
"prompt_tokens": 25,
"completion_tokens": 48,
"total_tokens": 73
}
}
This is automatically requested by Oru-el via stream_options: { include_usage: true } — you don't need to set this yourself.
Streaming with tool calls#
When a model decides to call a tool during streaming, the tool call is delivered across multiple chunks:
stream = client.chat.completions.create(
model="llama-4-maverick",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.tool_calls:
for tool_call in delta.tool_calls:
print(f"Tool: {tool_call.function.name or ''}", end="")
print(f"Args: {tool_call.function.arguments or ''}", end="")
See Tool Calling for the complete guide.
Best practices#
- Always handle the stream end — check for
data: [DONE]orfinish_reasonbeing set - Buffer partial JSON — SSE chunks may split across network packets; buffer until you have a complete
data:line - Set
max_tokens— prevent unexpectedly long streams by setting a reasonable token limit - Handle disconnects — if the connection drops, you'll need to make a new request (streams cannot be resumed)
- Use streaming for user-facing UIs — it dramatically improves perceived latency by showing tokens as they arrive