Inference Parameters

Detailed reference for every inference parameter — temperature, top_p, penalties, and more.

Inference Parameters#

This page documents every parameter you can pass to the chat completions endpoint. Understanding these parameters lets you control the creativity, length, and behavior of model outputs.

Sampling parameters#

These parameters control how the model selects the next token during generation.

temperature#

Type: number (0 to 2) | Default: 0.7

Controls the randomness of the output. Lower values make the model more deterministic and focused; higher values make it more creative and varied.

  • 0 — Deterministic. The model always picks the highest-probability token. Best for factual retrieval and code generation.
  • 0.1 - 0.4 — Low randomness. Focused and consistent. Good for data extraction, classification, and structured tasks.
  • 0.5 - 0.8 — Moderate randomness. Balanced between creativity and coherence. Good for general conversation and writing assistance.
  • 1.0 - 1.5 — High randomness. More creative and varied. Good for brainstorming, creative writing, and generating diverse outputs.
  • 1.5 - 2.0 — Very high randomness. Outputs become increasingly unpredictable. Rarely useful in production.
# Deterministic — always the same answer
response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    temperature=0,
)

# Creative — varied responses each time
response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Write a creative opening line for a novel."}],
    temperature=1.2,
)

top_p#

Type: number (0 to 1) | Default: 1.0

Nucleus sampling. Instead of considering all possible tokens, the model only considers the smallest set of tokens whose cumulative probability exceeds top_p.

  • 1.0 — Consider all tokens (no filtering)
  • 0.9 — Consider tokens in the top 90% of probability mass
  • 0.1 — Consider only the most likely tokens covering 10% of probability

Lower top_p values produce more focused outputs. Generally, adjust either temperature or top_p, not both.

response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Summarize this article."}],
    top_p=0.9,
    temperature=0.7,
)

top_k#

Type: integer (0 to 200) | Default: not set

Limits the model to only consider the top K most probable tokens at each step. This is a hard cutoff, unlike top_p which is probability-based.

  • 1 — Greedy decoding (equivalent to temperature=0)
  • 10 — Only the top 10 tokens are considered
  • 50 — A common value that balances diversity and quality
  • 200 — Very permissive filtering

top_k is applied before top_p. When both are set, the model first takes the top K tokens, then applies nucleus sampling within that set.

response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Generate a creative poem."}],
    extra_body={"top_k": 50},
)

min_p#

Type: number (0 to 1) | Default: not set

Dynamic minimum probability filtering. Tokens with probability less than min_p times the probability of the most likely token are discarded.

Unlike top_k (fixed number) or top_p (fixed cumulative probability), min_p adapts to the model's confidence. When the model is confident (one token has high probability), min_p aggressively filters. When the model is uncertain (probabilities are spread), it allows more diversity.

  • 0.05 — Discard tokens with less than 5% of the top token's probability
  • 0.1 — More aggressive filtering
response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    extra_body={"min_p": 0.05},
)

seed#

Type: integer | Default: not set

Seed for the random number generator. When set, repeated requests with the same seed and parameters should produce the same output. Useful for debugging and testing.

Reproducibility is best-effort — it depends on the model architecture.

response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Tell me a joke."}],
    seed=42,
)

Length parameters#

max_tokens#

Type: integer (1 to 131072) | Default: model-dependent

The maximum number of tokens the model can generate in its response. The actual response may be shorter if the model finishes naturally or hits a stop sequence.

The total of input tokens plus max_tokens cannot exceed the model's context window. If you set max_tokens too high for the remaining context, the model will generate up to the context limit.

# Short response
response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "What is Python?"}],
    max_tokens=50,
)

# Long response
response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Write a detailed tutorial on Docker."}],
    max_tokens=4096,
)

stop#

Type: string or array of strings | Default: not set

Stop sequences tell the model to stop generating when it produces one of these strings. You can specify a single string or an array of up to 4 strings, each up to 256 characters.

# Stop at a specific marker
response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "List 3 fruits:"}],
    stop=["\n4.", "\n\n"],
)

# Stop at code block end
response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Write a Python function."}],
    stop=["```"],
)

Penalty parameters#

These parameters control repetition in the generated text. They work differently and can be combined.

frequency_penalty#

Type: number (-2 to 2) | Default: 0

Penalizes tokens proportionally to how many times they have already appeared in the output. A token that appears 5 times is penalized 5 times as much as a token that appears once.

  • Positive values (0.1 to 2.0) reduce repetition by discouraging frequently used tokens
  • 0 — No penalty
  • Negative values (-2.0 to 0) encourage repetition

Best for reducing repetitive word patterns.

response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Write a long essay about space."}],
    frequency_penalty=0.5,
)

presence_penalty#

Type: number (-2 to 2) | Default: 0

Penalizes tokens that have appeared at all in the output, regardless of how many times. A token that appeared once is penalized the same as one that appeared 10 times.

  • Positive values (0.1 to 2.0) encourage the model to introduce new topics
  • 0 — No penalty
  • Negative values (-2.0 to 0) encourage staying on topic

Best for encouraging topical diversity.

response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Brainstorm business ideas."}],
    presence_penalty=0.8,
)

repetition_penalty#

Type: number (0.1 to 3) | Default: 1.0

A multiplicative penalty that divides the logits of previously generated tokens. This is different from frequency_penalty and presence_penalty, which are additive.

  • 1.0 — No penalty
  • 1.1 - 1.3 — Mild reduction in repetition
  • 1.5+ — Strong reduction; may degrade coherence

This parameter is specific to open-source models and may not behave identically to the OpenAI penalty parameters.

response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Write a story."}],
    extra_body={"repetition_penalty": 1.15},
)

Comparing the penalty parameters#

ParameterMechanismGood for
frequency_penaltyAdditive, proportional to countReducing repeated words and phrases
presence_penaltyAdditive, binary (appeared or not)Encouraging new topics and variety
repetition_penaltyMultiplicative on logitsGeneral anti-repetition for open-source models

In most cases, using one penalty parameter is sufficient. Start with frequency_penalty: 0.3 for general anti-repetition.

Output format parameters#

response_format#

Type: object | Default: {"type": "text"}

Controls the output format.

  • {"type": "text"} — Default. Free-form text output.
  • {"type": "json_object"} — Forces the model to produce valid JSON. You must also instruct the model to output JSON in your prompt.

See JSON Mode for details.

response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[
        {"role": "system", "content": "Output valid JSON."},
        {"role": "user", "content": "List 3 programming languages with their year of creation."},
    ],
    response_format={"type": "json_object"},
)

stream#

Type: boolean | Default: false

When true, the response is delivered as a stream of server-sent events. See Streaming for details.

Tool parameters#

tools#

Type: array (max 128 tools) | Default: not set

Defines functions the model can call. See Tool Calling for the complete guide.

tool_choice#

Type: string or object | Default: not set

Controls whether and which tools the model calls:

  • "auto" — Model decides whether to call a tool
  • "none" — Model will not call any tools
  • {"type": "function", "function": {"name": "function_name"}} — Force a specific tool

Routing parameters#

tier#

Type: string | Default: "standard"

Selects the pricing and routing tier:

  • "standard" — Default. Cost-optimized.
  • "turbo" — Low-latency routing. Higher price.
response = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"tier": "turbo"},
)
Use casetemperaturetop_pmax_tokensOther
Code generation0 - 0.20.952048stop=["```"]
Creative writing0.8 - 1.21.04096frequency_penalty: 0.5
Data extraction01.01024response_format: {"type": "json_object"}
Q&A / Factual0 - 0.30.9512
Summarization0.3 - 0.50.951024
Translation0 - 0.21.02048
Brainstorming1.0 - 1.51.02048presence_penalty: 0.8
Chatbot0.5 - 0.70.91024frequency_penalty: 0.3
Classification01.010
Inference Parameters · Oru'el Docs