Skip to content

Semantic Search

Build a semantic search pipeline using text embeddings and cross-encoder reranking. inference4j provides two complementary models: SentenceTransformerEmbedder for fast retrieval and MiniLMSearchReranker for precision reranking.

Quick example — embeddings

try (var embedder = SentenceTransformerEmbedder.builder()
        .modelId("inference4j/all-MiniLM-L6-v2").build()) {
    float[] embedding = embedder.encode("Hello, world!");
}

Quick example — reranking

try (var reranker = MiniLMSearchReranker.builder().build()) {
    float score = reranker.score("What is Java?", "Java is a programming language.");
}

Full search pipeline

A typical semantic search pipeline uses embeddings for fast candidate retrieval, then a cross-encoder reranker for precision scoring of the top results.

import io.github.inference4j.nlp.SentenceTransformerEmbedder;
import io.github.inference4j.nlp.MiniLMSearchReranker;

public class SemanticSearch {
    public static void main(String[] args) {
        String query = "How do I handle errors in Java?";
        List<String> documents = List.of(
            "Java uses try-catch blocks for exception handling.",
            "Python decorators are a powerful feature.",
            "Error handling in Java includes checked and unchecked exceptions.",
            "The Java Stream API provides functional operations on collections."
        );

        // Stage 1: Embed and retrieve candidates by cosine similarity
        try (var embedder = SentenceTransformerEmbedder.builder()
                .modelId("inference4j/all-MiniLM-L6-v2").build()) {

            float[] queryEmbedding = embedder.encode(query);
            List<float[]> docEmbeddings = embedder.encodeBatch(documents);

            // Rank by cosine similarity (your own similarity function)
            // ...
        }

        // Stage 2: Rerank top candidates with cross-encoder
        try (var reranker = MiniLMSearchReranker.builder().build()) {
            float[] scores = reranker.scoreBatch(query, documents);

            for (int i = 0; i < documents.size(); i++) {
                System.out.printf("%.4f  %s%n", scores[i], documents.get(i));
            }
        }
    }
}

Embedder builder options

Method Type Default Description
.modelId(String) String — (required) HuggingFace model ID
.modelSource(ModelSource) ModelSource HuggingFaceModelSource Model resolution strategy
.sessionOptions(SessionConfigurer) SessionConfigurer default ONNX Runtime session config
.tokenizer(Tokenizer) Tokenizer auto-loaded WordPieceTokenizer Custom tokenizer
.poolingStrategy(PoolingStrategy) PoolingStrategy MEAN Pooling method: CLS, MEAN, or MAX
.normalize() disabled Enables L2 normalization of output embeddings
.textPrefix(String) String null Text prefix to prepend before encoding
.maxLength(int) int 512 Maximum token sequence length

Reranker builder options

Method Type Default Description
.modelId(String) String inference4j/ms-marco-MiniLM-L-6-v2 HuggingFace model ID
.modelSource(ModelSource) ModelSource HuggingFaceModelSource Model resolution strategy
.sessionOptions(SessionConfigurer) SessionConfigurer default ONNX Runtime session config
.tokenizer(Tokenizer) Tokenizer auto-loaded WordPieceTokenizer Custom tokenizer
.maxLength(int) int 512 Maximum token sequence length

Result types

Embeddings

encode() returns a float[] — a dense vector representation of the input text. Use cosine similarity to compare embeddings:

static double cosineSimilarity(float[] a, float[] b) {
    double dot = 0, normA = 0, normB = 0;
    for (int i = 0; i < a.length; i++) {
        dot += a[i] * b[i];
        normA += a[i] * a[i];
        normB += b[i] * b[i];
    }
    return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

Reranking scores

score() returns a float in the range [0, 1] — the sigmoid-normalized relevance score for a query-document pair. Higher means more relevant.

scoreBatch() returns a float[] — one score per document, scored against the same query.

Pooling strategies

The embedder supports three pooling strategies for converting token-level representations into a single sentence embedding:

Strategy Description
MEAN Average of all token embeddings (default, best for most tasks)
CLS Uses only the [CLS] token embedding
MAX Element-wise maximum across all token embeddings

Available embedding models

Model Dim Pooling Normalize Text Prefix MTEB Avg
inference4j/all-MiniLM-L6-v2 384 MEAN Optional 56.26
inference4j/all-mpnet-base-v2 768 MEAN Optional 57.78
inference4j/bge-base-en-v1.5 768 CLS Recommended 63.55
inference4j/gte-base 768 CLS Recommended 61.36

BGE example

try (var embedder = SentenceTransformerEmbedder.builder()
        .modelId("inference4j/bge-base-en-v1.5")
        .poolingStrategy(PoolingStrategy.CLS)
        .normalize()
        .build()) {
    float[] embedding = embedder.encode("What is machine learning?");
}

E5 example (with text prefix)

E5 models require a text prefix: "query: " for queries, "passage: " for documents.

try (var queryEncoder = SentenceTransformerEmbedder.builder()
        .modelId("inference4j/e5-base-v2")
        .textPrefix("query: ")
        .normalize()
        .build()) {
    float[] queryEmbedding = queryEncoder.encode("What is Java?");
}

Tips

  • Two-stage pipeline: Use embeddings for fast top-K retrieval (cheap cosine similarity), then rerank the top candidates with the cross-encoder (expensive but more accurate).
  • Batch encoding: Use encodeBatch() when encoding multiple texts — more efficient than calling encode() in a loop.
  • L2 normalization: Enable .normalize() when comparing embeddings with cosine similarity. BGE, GTE, and E5 models all recommend normalization.
  • Text prefix: Some models (E5, Nomic) require a text prefix. Check the model card for the correct prefix.
  • Embedding dimension depends on the model: all-MiniLM-L6-v2 produces 384-dimensional vectors, while BGE/GTE/mpnet produce 768-dimensional vectors.