Streaming

Stream chat completion responses in real time using server-sent events (SSE).

Streaming#

Streaming delivers tokens to the client as they are generated, instead of waiting for the entire response to complete. This is essential for chat interfaces and real-time applications where you want to display text as it appears.

How it works#

When you set stream: true in your request, the API returns a stream of server-sent events (SSE). Each event contains a chunk of the response with one or more new tokens.

The stream follows this pattern:

  1. The connection is opened and the server begins generating tokens
  2. Each generated token (or small group of tokens) is sent as a data: event
  3. A final chunk includes usage statistics (token counts)
  4. The stream ends with data: [DONE]

SSE format#

Each event is a line starting with data: followed by a JSON object:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"llama-4-maverick","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"llama-4-maverick","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"llama-4-maverick","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":2,"total_tokens":12}}

data: [DONE]

Chunk structure#

Each chunk has the same structure as a non-streaming response, but uses delta instead of message:

FieldDescription
choices[].delta.rolePresent only in the first chunk (always "assistant")
choices[].delta.contentNew token(s) in this chunk (may be empty or absent)
choices[].delta.tool_callsTool call data (when the model invokes tools)
choices[].finish_reasonnull during generation; set on the final content chunk
usageToken counts — included in the last chunk before [DONE]

Python streaming#

The OpenAI Python SDK handles SSE parsing automatically:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.oru-el.com/v1/inference",
    api_key="oruel_your_api_key_here",
)

stream = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about programming."},
    ],
    stream=True,
    max_tokens=100,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

print()  # newline at the end

Collecting the full response#

from openai import OpenAI

client = OpenAI(
    base_url="https://api.oru-el.com/v1/inference",
    api_key="oruel_your_api_key_here",
)

stream = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Explain recursion."}],
    stream=True,
)

collected_content = []
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        collected_content.append(delta.content)
    # Check for usage in the final chunk
    if chunk.usage:
        print(f"\nTokens used: {chunk.usage.total_tokens}")

full_response = "".join(collected_content)
print(full_response)

Async streaming#

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://api.oru-el.com/v1/inference",
    api_key="oruel_your_api_key_here",
)

async def main():
    stream = await client.chat.completions.create(
        model="llama-4-maverick",
        messages=[{"role": "user", "content": "Tell me a short story."}],
        stream=True,
    )

    async for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)

asyncio.run(main())

JavaScript streaming#

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.oru-el.com/v1/inference",
  apiKey: "oruel_your_api_key_here",
});

const stream = await client.chat.completions.create({
  model: "llama-4-maverick",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Write a haiku about programming." },
  ],
  stream: true,
  max_tokens: 100,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

console.log();

Browser (fetch API)#

For browser-based applications, you can parse the SSE stream directly:

async function streamChat(messages) {
  const response = await fetch(
    "https://api.oru-el.com/v1/inference/chat/completions",
    {
      method: "POST",
      headers: {
        Authorization: "Bearer oruel_your_api_key_here",
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: "llama-4-maverick",
        messages,
        stream: true,
      }),
    }
  );

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n");
    buffer = lines.pop(); // keep incomplete line in buffer

    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      const data = line.slice(6).trim();
      if (data === "[DONE]") return;

      const parsed = JSON.parse(data);
      const content = parsed.choices[0]?.delta?.content;
      if (content) {
        // Append to your UI
        document.getElementById("output").textContent += content;
      }
    }
  }
}

cURL streaming#

curl https://api.oru-el.com/v1/inference/chat/completions \
  -H "Authorization: Bearer oruel_your_api_key_here" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "llama-4-maverick",
    "messages": [
      {"role": "user", "content": "Write a haiku about programming."}
    ],
    "stream": true
  }'

The -N flag disables cURL's output buffering so you see tokens as they arrive.

Usage statistics#

Token usage is included in the final chunk of the stream (the chunk just before data: [DONE]). The usage field contains:

{
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 48,
    "total_tokens": 73
  }
}

This is automatically requested by Oru-el via stream_options: { include_usage: true } — you don't need to set this yourself.

Streaming with tool calls#

When a model decides to call a tool during streaming, the tool call is delivered across multiple chunks:

stream = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.tool_calls:
        for tool_call in delta.tool_calls:
            print(f"Tool: {tool_call.function.name or ''}", end="")
            print(f"Args: {tool_call.function.arguments or ''}", end="")

See Tool Calling for the complete guide.

Best practices#

  • Always handle the stream end — check for data: [DONE] or finish_reason being set
  • Buffer partial JSON — SSE chunks may split across network packets; buffer until you have a complete data: line
  • Set max_tokens — prevent unexpectedly long streams by setting a reasonable token limit
  • Handle disconnects — if the connection drops, you'll need to make a new request (streams cannot be resumed)
  • Use streaming for user-facing UIs — it dramatically improves perceived latency by showing tokens as they arrive