Phi-3.5 Vision¶
Ask questions about images and get text answers using Microsoft's Phi-3.5 Vision model.
See the overview for background on how autoregressive generation differs from single-pass inference.
Quick example¶
try (var vision = VisionLanguageModel.builder()
.model(ModelSources.phi3Vision())
.build()) {
GenerationResult result = vision.generate(
new VisionInput(Path.of("cat.jpg"), "Describe this image."));
System.out.println(result.text());
}
The image shows a close-up of an orange tabby cat with a white collar.
The cat's eyes are green, and it has a neutral expression. The background
is blurred, with hints of a red object and a green plant.
Full example¶
import io.github.inference4j.generation.GenerationResult;
import io.github.inference4j.genai.ModelSources;
import io.github.inference4j.genai.vision.VisionInput;
import io.github.inference4j.genai.vision.VisionLanguageModel;
import java.nio.file.Path;
public class ImageDescription {
public static void main(String[] args) {
Path image = Path.of("cat.jpg");
try (var vision = VisionLanguageModel.builder()
.model(ModelSources.phi3Vision())
.build()) {
// Describe an image
GenerationResult description = vision.generate(
new VisionInput(image, "Describe this image."));
System.out.println(description.text());
// Ask a question about it
GenerationResult answer = vision.generate(
new VisionInput(image, "What colors are prominent in this image?"));
System.out.println(answer.text());
}
}
}
Streaming¶
Pass a Consumer<String> to receive tokens as they are generated:
try (var vision = VisionLanguageModel.builder()
.model(ModelSources.phi3Vision())
.build()) {
vision.generate(
new VisionInput(Path.of("cat.jpg"), "Describe this image."),
token -> System.out.print(token));
}
Builder options¶
| Method | Type | Default | Description |
|---|---|---|---|
.model(GenerativeModel) |
GenerativeModel |
— | Preconfigured model from ModelSources |
.modelId(String) |
String |
— | HuggingFace model ID (requires .chatTemplate()) |
.modelSource(ModelSource) |
ModelSource |
HuggingFaceModelSource |
Model resolution strategy |
.chatTemplate(ChatTemplate) |
ChatTemplate |
— | Prompt formatting (must include <\|image_1\|> placeholder) |
.maxLength(int) |
int |
4096 |
Maximum total sequence length (input image tokens + output tokens) |
.temperature(double) |
double |
0.0 |
Sampling temperature (0 = greedy) |
.topK(int) |
int |
0 (disabled) |
Top-K sampling |
.topP(double) |
double |
0.0 (disabled) |
Nucleus sampling |
Methods¶
| Method | Description |
|---|---|
generate(VisionInput) |
Generate text from an image and prompt |
generate(VisionInput, Consumer) |
Generate with token streaming |
How it works¶
Phi-3.5 Vision is a decoder-only vision-language model. Unlike Whisper (encoder-decoder), it interleaves image tokens with text tokens in a single sequence:
- The image goes through a CLIP ViT encoder to produce visual embeddings
- An MLP projector maps visual embeddings into the language model's token space
- The Phi-3 Mini decoder processes interleaved image + text tokens autoregressively
flowchart LR
A["Image file"] --> B["onnxruntime-genai<br>CLIP encoder → projector → Phi-3 decoder"]
C["Text prompt"] --> B
B --> D["GenerationResult"]
All heavy lifting — image preprocessing, vision encoding, embedding projection, autoregressive decoding, KV cache — is handled natively by onnxruntime-genai.
Custom models¶
For a custom vision-language model, provide a model source and chat template:
try (var vision = VisionLanguageModel.builder()
.modelId("my-org/my-vision-model")
.chatTemplate(msg ->
"<|user|>\n<|image_1|>\n" + msg + "\n<|end|>\n<|assistant|>\n")
.build()) {
vision.generate(new VisionInput(Path.of("photo.jpg"), "Describe this image."));
}
The chat template must include <|image_1|> where the image embeddings should be
inserted. The MultiModalProcessor handles the substitution internally.
Tips¶
temperature(0.0)(default) gives deterministic, greedy decoding — best for factual descriptions.- Use higher
maxLengthfor detailed descriptions and lower values for short answers. - The model downloads ~3.3 GB on first use (INT4 quantized).
- Reuse
VisionLanguageModelinstances — each one holds three ONNX models in memory (vision encoder, embedding projector, text decoder).