Semantic Search¶
Build a semantic search pipeline using text embeddings and cross-encoder reranking. inference4j provides two complementary models: SentenceTransformerEmbedder for fast retrieval and MiniLMSearchReranker for precision reranking.
Quick example — embeddings¶
try (var embedder = SentenceTransformerEmbedder.builder()
.modelId("inference4j/all-MiniLM-L6-v2").build()) {
float[] embedding = embedder.encode("Hello, world!");
}
Quick example — reranking¶
try (var reranker = MiniLMSearchReranker.builder().build()) {
float score = reranker.score("What is Java?", "Java is a programming language.");
}
Full search pipeline¶
A typical semantic search pipeline uses embeddings for fast candidate retrieval, then a cross-encoder reranker for precision scoring of the top results.
import io.github.inference4j.nlp.SentenceTransformerEmbedder;
import io.github.inference4j.nlp.MiniLMSearchReranker;
public class SemanticSearch {
public static void main(String[] args) {
String query = "How do I handle errors in Java?";
List<String> documents = List.of(
"Java uses try-catch blocks for exception handling.",
"Python decorators are a powerful feature.",
"Error handling in Java includes checked and unchecked exceptions.",
"The Java Stream API provides functional operations on collections."
);
// Stage 1: Embed and retrieve candidates by cosine similarity
try (var embedder = SentenceTransformerEmbedder.builder()
.modelId("inference4j/all-MiniLM-L6-v2").build()) {
float[] queryEmbedding = embedder.encode(query);
List<float[]> docEmbeddings = embedder.encodeBatch(documents);
// Rank by cosine similarity (your own similarity function)
// ...
}
// Stage 2: Rerank top candidates with cross-encoder
try (var reranker = MiniLMSearchReranker.builder().build()) {
float[] scores = reranker.scoreBatch(query, documents);
for (int i = 0; i < documents.size(); i++) {
System.out.printf("%.4f %s%n", scores[i], documents.get(i));
}
}
}
}
Embedder builder options¶
| Method | Type | Default | Description |
|---|---|---|---|
.modelId(String) |
String |
inference4j/all-MiniLM-L6-v2 |
HuggingFace model ID |
.modelSource(ModelSource) |
ModelSource |
HuggingFaceModelSource |
Model resolution strategy |
.sessionOptions(SessionConfigurer) |
SessionConfigurer |
default | ONNX Runtime session config |
.tokenizer(Tokenizer) |
Tokenizer |
auto-loaded WordPieceTokenizer |
Custom tokenizer |
.poolingStrategy(PoolingStrategy) |
PoolingStrategy |
MEAN |
Pooling method: CLS, MEAN, or MAX |
.maxLength(int) |
int |
512 |
Maximum token sequence length |
Reranker builder options¶
| Method | Type | Default | Description |
|---|---|---|---|
.modelId(String) |
String |
inference4j/ms-marco-MiniLM-L-6-v2 |
HuggingFace model ID |
.modelSource(ModelSource) |
ModelSource |
HuggingFaceModelSource |
Model resolution strategy |
.sessionOptions(SessionConfigurer) |
SessionConfigurer |
default | ONNX Runtime session config |
.tokenizer(Tokenizer) |
Tokenizer |
auto-loaded WordPieceTokenizer |
Custom tokenizer |
.maxLength(int) |
int |
512 |
Maximum token sequence length |
Result types¶
Embeddings¶
encode() returns a float[] — a dense vector representation of the input text. Use cosine similarity to compare embeddings:
static double cosineSimilarity(float[] a, float[] b) {
double dot = 0, normA = 0, normB = 0;
for (int i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
Reranking scores¶
score() returns a float in the range [0, 1] — the sigmoid-normalized relevance score for a query-document pair. Higher means more relevant.
scoreBatch() returns a float[] — one score per document, scored against the same query.
Pooling strategies¶
The embedder supports three pooling strategies for converting token-level representations into a single sentence embedding:
| Strategy | Description |
|---|---|
MEAN |
Average of all token embeddings (default, best for most tasks) |
CLS |
Uses only the [CLS] token embedding |
MAX |
Element-wise maximum across all token embeddings |
Tips¶
- Two-stage pipeline: Use embeddings for fast top-K retrieval (cheap cosine similarity), then rerank the top candidates with the cross-encoder (expensive but more accurate).
- Batch encoding: Use
encodeBatch()when encoding multiple texts — more efficient than callingencode()in a loop. - Embedding dimension depends on the model: all-MiniLM-L6-v2 produces 384-dimensional vectors.