Skip to content

Text Generation

Generate text with decoder-only language models like Phi-3 and DeepSeek-R1, with built-in streaming support.

See the overview for background on how autoregressive generation differs from single-pass inference.

Quick example

try (var generator = TextGenerator.builder()
        .model(ModelSources.phi3Mini())
        .build()) {
    System.out.println(generator.generate("What is Java?").text());
}

Full example

import io.github.inference4j.generation.GenerationResult;
import io.github.inference4j.genai.ModelSources;
import io.github.inference4j.genai.nlp.TextGenerator;

public class TextGeneration {
    public static void main(String[] args) {
        try (var generator = TextGenerator.builder()
                .model(ModelSources.phi3Mini())
                .maxLength(200)
                .temperature(0.7)
                .build()) {

            GenerationResult result = generator.generate("What is Java in one sentence?");

            System.out.println(result.text());
            System.out.printf("%d tokens in %,d ms%n",
                    result.generatedTokens(), result.duration().toMillis());
        }
    }
}
Screenshot from showcase app
Screenshot from showcase app

Streaming

Pass a Consumer<String> to receive tokens as they are generated:

try (var generator = TextGenerator.builder()
        .model(ModelSources.deepSeekR1_1_5B())
        .maxLength(200)
        .build()) {
    generator.generate("Explain recursion in simple terms.", token -> System.out.print(token));
}

The final GenerationResult is still returned after generation completes, containing the full text and timing information.

Builder options

Method Type Default Description
.model(GenerativeModel) GenerativeModel Preconfigured model from ModelSources
.modelSource(ModelSource) ModelSource Custom model source (requires .chatTemplate())
.chatTemplate(ChatTemplate) ChatTemplate Prompt formatting for custom models
.maxLength(int) int 1024 Maximum number of tokens to generate
.temperature(double) double 1.0 Sampling temperature (higher = more random)
.topK(int) int 0 (disabled) Top-K sampling (keep K most probable tokens)
.topP(double) double 0.0 (disabled) Nucleus sampling (keep tokens summing to P probability)

Result type

GenerationResult is a record with three fields:

Field Type Description
text() String The generated text
promptTokens() int Number of tokens in the input prompt (0 if unknown)
generatedTokens() int Number of tokens generated
duration() Duration Wall-clock generation time

How it works

TextGenerator formats the prompt using the model's chat template, then delegates to onnxruntime-genai for tokenization, the autoregressive generation loop, and decoding.

flowchart LR
    A["User prompt"] --> B["ChatTemplate<br>format prompt"]
    B --> C["onnxruntime-genai<br>tokenize → generate → decode"]
    C --> D["GenerationResult"]

See the overview for a detailed explanation of the generation loop, KV cache, and why this architecture differs from single-pass wrappers.

Tips

  • Generation speed scales with model size. DeepSeek-R1 (1.5B) is noticeably faster than Phi-3 (3.8B) on the same hardware.
  • Lower maxLength for short answers — it bounds the generation loop.
  • temperature below 1.0 gives more focused output; above 1.0 gives more varied output.
  • Reuse TextGenerator instances across prompts — each one holds the model in memory.