Native Text Generation¶

Generate text with GPT-2, SmolLM2, TinyLlama, Qwen2.5, Gemma 2, and other models using inference4j's native generation loop — no additional dependencies beyond ONNX Runtime.

OnnxTextGenerator is the single entry point for all natively-supported text generation models. Named presets provide one-liner access to popular models, and the generic builder supports custom models.

Quick example¶

// GPT-2 — completion model
try (var gen = OnnxTextGenerator.gpt2().maxNewTokens(50).build()) {
    System.out.println(gen.generate("Once upon a time").text());
}

// SmolLM2-360M — ChatML instruct model
try (var gen = OnnxTextGenerator.smolLM2().maxNewTokens(50).build()) {
    System.out.println(gen.generate("What is the capital of France?").text());
}

// TinyLlama-1.1B-Chat — Zephyr-style instruct model
try (var gen = OnnxTextGenerator.tinyLlama().maxNewTokens(100).build()) {
    System.out.println(gen.generate("Explain gravity").text());
}

// Qwen2.5-1.5B — ChatML instruct model
try (var gen = OnnxTextGenerator.qwen2().maxNewTokens(100).build()) {
    System.out.println(gen.generate("Explain gravity").text());
}

Full example¶

import io.github.inference4j.generation.GenerationResult;
import io.github.inference4j.nlp.OnnxTextGenerator;

public class TextGeneration {
    public static void main(String[] args) {
        try (var gen = OnnxTextGenerator.qwen2()
                .maxNewTokens(100)
                .temperature(0.8f)
                .topK(50)
                .topP(0.9f)
                .build()) {

            GenerationResult result = gen.generate("The meaning of life is");

            System.out.println(result.text());
            System.out.printf("%d tokens in %,d ms%n",
                    result.generatedTokens(), result.duration().toMillis());
        }
    }
}

Enable sampling for better output

GPT-2 defaults to greedy decoding (temperature=0), which produces repetitive text. Set temperature, topK, and topP for more coherent output. Instruct models (SmolLM2, Qwen2.5) also benefit from sampling.

Streaming¶

Pass a Consumer<String> to receive tokens as they are generated:

try (var gen = OnnxTextGenerator.smolLM2()
        .maxNewTokens(100)
        .temperature(0.8f)
        .topK(50)
        .build()) {
    gen.generate("Tell me a joke", token -> System.out.print(token));
}

The final GenerationResult is still returned after generation completes, containing the full text and timing information.

Model presets¶

Preset	Model	Parameters	Size	Chat Template
`OnnxTextGenerator.gpt2()`	GPT-2	124M	~500 MB	None (completion)
`OnnxTextGenerator.smolLM2()`	SmolLM2-360M-Instruct	360M	~700 MB	ChatML
`OnnxTextGenerator.tinyLlama()`	TinyLlama-1.1B-Chat	1.1B	~2.2 GB	Zephyr (`<\\|user\\|>` / `</s>`)
`OnnxTextGenerator.qwen2()`	Qwen2.5-1.5B-Instruct	1.5B	~3 GB	ChatML
`OnnxTextGenerator.gemma2()`	Gemma 2-2B-IT	2.6B	~5 GB	`<start_of_turn>` / `<end_of_turn>`

Gated models

Gemma 2 is a gated model — you must accept Google's license terms on HuggingFace before downloading. The gemma2() preset does not set a model ID for auto-download. Provide the model directory yourself via modelSource():

try (var gen = OnnxTextGenerator.gemma2()
        .modelSource(id -> Path.of("/path/to/gemma-2-2b-it"))
        .maxNewTokens(100)
        .build()) {
    gen.generate("What is Java?", token -> System.out.print(token));
}

Builder options¶

Method	Type	Default	Description
`.modelId(String)`	`String`	Preset-dependent	HuggingFace model ID
`.modelSource(ModelSource)`	`ModelSource`	`HuggingFaceModelSource`	Model resolution strategy
`.sessionOptions(SessionConfigurer)`	`SessionConfigurer`	—	ONNX Runtime session options (e.g., thread count)
`.chatTemplate(ChatTemplate)`	`ChatTemplate`	Preset-dependent	Prompt formatting
`.addedToken(String)`	`String`	Preset-dependent	Register a special token for atomic encoding
`.tokenizerProvider(TokenizerProvider)`	`TokenizerProvider`	GPT-2 BPE	Tokenizer construction strategy (e.g., SentencePiece for Gemma)
`.maxNewTokens(int)`	`int`	`256`	Maximum number of tokens to generate
`.temperature(float)`	`float`	`0.0`	Sampling temperature (higher = more random)
`.topK(int)`	`int`	`0` (disabled)	Top-K sampling (keep K most probable tokens)
`.topP(float)`	`float`	`0.0` (disabled)	Nucleus sampling (keep tokens summing to P probability)
`.eosTokenId(int)`	`int`	Auto-detected	End-of-sequence token ID (loaded from `config.json`)
`.stopSequence(String)`	`String`	—	Stop sequence (can be called multiple times)

Result type¶

GenerationResult is a record with:

Field	Type	Description
`text()`	`String`	The generated text
`promptTokens()`	`int`	Number of tokens in the input prompt
`generatedTokens()`	`int`	Number of tokens generated
`duration()`	`Duration`	Wall-clock generation time

How it works¶

OnnxTextGenerator uses inference4j's native generation engine. The entire autoregressive loop — tokenization, KV cache management, sampling, and decoding — runs in Java, with only the forward passes delegated to ONNX Runtime.

flowchart TD
    A["User prompt"] --> B["Tokenize<br><small>inference4j</small>"]
    B --> C["Forward pass + KV cache<br><small>ONNX Runtime</small>"]
    C --> D["Sample next token<br><small>inference4j</small>"]
    D --> E{"Stop?"}
    E -- No --> C
    E -- Yes --> F["Decode tokens<br><small>inference4j</small>"]
    F --> G["GenerationResult"]

See the introduction for a detailed explanation of the autoregressive loop, KV cache, and how native generation compares to onnxruntime-genai.

Custom models¶

Use OnnxTextGenerator.builder() for any BPE-based causal LM exported to ONNX with KV cache:

try (var gen = OnnxTextGenerator.builder()
        .modelId("my-org/my-model")
        .addedToken("<|special_start|>")
        .addedToken("<|special_end|>")
        .chatTemplate(msg -> "<|user|>" + msg + "<|assistant|>")
        .temperature(0.7f)
        .maxNewTokens(100)
        .build()) {
    gen.generate("Hello", token -> System.out.print(token));
}

By default, the builder uses GPT-2-style BPE (vocab.json + merges.txt). For SentencePiece models (Gemma, LLaMA, TinyLlama), use .tokenizerProvider(SentencePieceBpeTokenizer.provider()) which reads tokenizer.json instead.

The model directory must contain model.onnx and config.json, plus the tokenizer files required by the provider.

Tips¶

Use temperature(0.8f), topK(50), topP(0.9f) to avoid degenerate repetition from greedy decoding.
Lower maxNewTokens for demos or quick tests — it directly controls how many forward passes run.
Reuse OnnxTextGenerator instances across prompts — each one holds the model and tokenizer in memory.
Models download on first use and are cached in ~/.cache/inference4j/.