Phi-3.5 Vision¶

Ask questions about images and get text answers using Microsoft's Phi-3.5 Vision model.

See the overview for background on how autoregressive generation differs from single-pass inference.

Quick example¶

try (var vision = VisionLanguageModel.builder()
        .model(ModelSources.phi3Vision())
        .build()) {
    GenerationResult result = vision.generate(
            new VisionInput(Path.of("cat.jpg"), "Describe this image."));
    System.out.println(result.text());
}

The image shows a close-up of an orange tabby cat with a white collar.
The cat's eyes are green, and it has a neutral expression. The background
is blurred, with hints of a red object and a green plant.

Full example¶

import io.github.inference4j.generation.GenerationResult;
import io.github.inference4j.genai.ModelSources;
import io.github.inference4j.genai.vision.VisionInput;
import io.github.inference4j.genai.vision.VisionLanguageModel;
import java.nio.file.Path;

public class ImageDescription {
    public static void main(String[] args) {
        Path image = Path.of("cat.jpg");

        try (var vision = VisionLanguageModel.builder()
                .model(ModelSources.phi3Vision())
                .build()) {

            // Describe an image
            GenerationResult description = vision.generate(
                    new VisionInput(image, "Describe this image."));
            System.out.println(description.text());

            // Ask a question about it
            GenerationResult answer = vision.generate(
                    new VisionInput(image, "What colors are prominent in this image?"));
            System.out.println(answer.text());
        }
    }
}

Streaming¶

Pass a Consumer<String> to receive tokens as they are generated:

try (var vision = VisionLanguageModel.builder()
        .model(ModelSources.phi3Vision())
        .build()) {
    vision.generate(
            new VisionInput(Path.of("cat.jpg"), "Describe this image."),
            token -> System.out.print(token));
}

Builder options¶

Method	Type	Default	Description
`.model(GenerativeModel)`	`GenerativeModel`	—	Preconfigured model from `ModelSources`
`.modelId(String)`	`String`	—	HuggingFace model ID (requires `.chatTemplate()`)
`.modelSource(ModelSource)`	`ModelSource`	`HuggingFaceModelSource`	Model resolution strategy
`.chatTemplate(ChatTemplate)`	`ChatTemplate`	—	Prompt formatting (must include `<\\|image_1\\|>` placeholder)
`.maxLength(int)`	`int`	`4096`	Maximum total sequence length (input image tokens + output tokens)
`.temperature(double)`	`double`	`0.0`	Sampling temperature (0 = greedy)
`.topK(int)`	`int`	`0` (disabled)	Top-K sampling
`.topP(double)`	`double`	`0.0` (disabled)	Nucleus sampling

Methods¶

Method	Description
`generate(VisionInput)`	Generate text from an image and prompt
`generate(VisionInput, Consumer)`	Generate with token streaming

How it works¶

Phi-3.5 Vision is a decoder-only vision-language model. Unlike Whisper (encoder-decoder), it interleaves image tokens with text tokens in a single sequence:

The image goes through a CLIP ViT encoder to produce visual embeddings
An MLP projector maps visual embeddings into the language model's token space
The Phi-3 Mini decoder processes interleaved image + text tokens autoregressively

flowchart LR
    A["Image file"] --> B["onnxruntime-genai<br>CLIP encoder → projector → Phi-3 decoder"]
    C["Text prompt"] --> B
    B --> D["GenerationResult"]

All heavy lifting — image preprocessing, vision encoding, embedding projection, autoregressive decoding, KV cache — is handled natively by onnxruntime-genai.

Custom models¶

For a custom vision-language model, provide a model source and chat template:

try (var vision = VisionLanguageModel.builder()
        .modelId("my-org/my-vision-model")
        .chatTemplate(msg ->
                "<|user|>\n<|image_1|>\n" + msg + "\n<|end|>\n<|assistant|>\n")
        .build()) {
    vision.generate(new VisionInput(Path.of("photo.jpg"), "Describe this image."));
}

The chat template must include <|image_1|> where the image embeddings should be inserted. The MultiModalProcessor handles the substitution internally.

Tips¶

temperature(0.0) (default) gives deterministic, greedy decoding — best for factual descriptions.
Use higher maxLength for detailed descriptions and lower values for short answers.
The model downloads ~3.3 GB on first use (INT4 quantized).
Reuse VisionLanguageModel instances — each one holds three ONNX models in memory (vision encoder, embedding projector, text decoder).