Skip to content

Visual Search

Classify images using arbitrary text labels — no training required — powered by CLIP.

Quick start

try (ClipClassifier classifier = ClipClassifier.builder().build()) {
    List<Classification> results = classifier.classify(
            Path.of("photo.jpg"), List.of("cat", "dog", "bird", "car", "airplane"));
    System.out.println(results.get(0).label());      // "cat"
    System.out.println(results.get(0).confidence());  // 0.92
}
Screenshot from showcase app
Screenshot from showcase app

Zero-shot classification

Unlike traditional image classifiers that are trained on a fixed set of labels, CLIP classifies images against any labels you provide at each call. Just pass the labels you need — no retraining, no fine-tuning, and no need to rebuild the classifier for different label sets:

try (ClipClassifier classifier = ClipClassifier.builder().build()) {
    // Emotion detection
    classifier.classify(image, List.of(
            "a photo of a happy person", "a photo of a sad person",
            "a photo of an angry person", "a photo of a surprised person"));

    // Product categorization
    classifier.classify(image, List.of(
            "a product photo of electronics", "a product photo of clothing",
            "a product photo of furniture", "a product photo of food"));

    // Scene classification
    classifier.classify(image, List.of(
            "a landscape photo of a beach", "a landscape photo of a mountain",
            "a landscape photo of a city", "a landscape photo of a forest"));
}

How it works

CLIP uses two separate encoders — one for images, one for text — trained so that matching image-text pairs produce similar embeddings. ClipClassifier wraps both encoders:

flowchart TD
    Image["Image"]
    Labels["Candidate labels<br><i>'a photo of a cat', 'a photo of a dog', ...</i>"]

    Image --> IE["ClipImageEncoder"]
    Labels --> TE["ClipTextEncoder"]

    IE --> IEmb["Image embedding<br>[512-dim]"]
    TE --> LEmb["Label embeddings<br>[512-dim] x N"]

    IEmb --> Sim["Dot-product similarity"]
    LEmb --> Sim

    Sim --> SM["Softmax"]
    SM --> Result["List&lt;Classification&gt;"]

Builder options

Option Type Default Description
modelId(String) String inference4j/clip-vit-base-patch32 HuggingFace model ID
modelSource(ModelSource) ModelSource HuggingFaceModelSource Where to load the model from
sessionOptions(SessionConfigurer) SessionConfigurer Default (CPU) ONNX Runtime session options

Prompt tips

CLIP was trained on natural language captions, so passing full prompt text as labels produces better results than bare nouns. The label text is passed directly to the text encoder — format it however works best for your use case:

Use case Label examples Why
General objects "a photo of a cat", "a photo of a dog" Matches CLIP training data format
Fine-grained "a photo of a tabby cat, a type of pet" Adds context for disambiguation
Scenes "a beach landscape", "a mountain landscape" Descriptive captions
Actions "a photo of a person running" Activity descriptions
Styles "an impressionist style painting" Art style descriptions

API methods

// Primary API — classify with candidate labels
List<Classification> classify(BufferedImage image, List<String> candidateLabels);
List<Classification> classify(BufferedImage image, List<String> candidateLabels, int topK);

// Path convenience overloads
List<Classification> classify(Path imagePath, List<String> candidateLabels);
List<Classification> classify(Path imagePath, List<String> candidateLabels, int topK);

// InferenceTask compatibility
List<Classification> run(ZeroShotInput<BufferedImage> input);

Advanced: direct encoder access

For use cases beyond classification — image search, image-text similarity, or custom pipelines — use ClipImageEncoder and ClipTextEncoder directly. See the CLIP Encoders reference.

Alternative models

The default model is inference4j/clip-vit-base-patch32 (ViT-B/32) — the smallest and fastest variant. You can use other CLIP-compatible models by exporting them to ONNX with the same input/output layout and pointing to them via .modelId() or .modelSource().

Possible variants (not yet tested with inference4j):

Model Source Embedding dim Notes
openai/clip-vit-base-patch16 OpenAI 512 16×16 patches — better quality, ~2× slower
openai/clip-vit-large-patch14 OpenAI 768 Best quality from OpenAI, significantly larger
laion/CLIP-ViT-B-32-laion2B-s34B-b79K OpenCLIP 512 Trained on LAION-2B, often outperforms OpenAI's original
laion/CLIP-ViT-L-14-laion2B-s32B-b82K OpenCLIP 768 Large variant trained on LAION-2B
google/siglip-base-patch16-224 Google 768 SigLIP — improved training objective, strong zero-shot performance

Note

Models with different embedding dimensions (e.g., 768 instead of 512) will work — the wrappers don't assume a fixed size. However, you must use the same model for both image and text encoding since embeddings are only comparable within the same model's vector space.

Model details

Property Value
Architecture ViT-B/32 (vision) + Transformer (text)
Embedding dimensions 512
Image input 224×224 RGB, CLIP-normalized
Text input BPE tokenized, max 77 tokens
Default model inference4j/clip-vit-base-patch32
Model size ~340 MB (vision) + ~255 MB (text)