Tokenizers¶
Transformer models don't work with raw text — they expect sequences of integer token IDs mapped from a fixed vocabulary. A tokenizer bridges this gap: it splits text into tokens, maps each token to its vocabulary index, and produces the metadata tensors (attention_mask, token_type_ids) that the model expects as input.
inference4j handles tokenization automatically. When you call classifier.classify("some text"), the wrapper tokenizes the input, runs inference, and decodes the output — you never touch a token ID. But if you need to customize tokenization or use tokenizers directly, this guide covers how.
Built-in tokenizers¶
inference4j ships five tokenizer implementations, covering the most common algorithms in production transformer models:
| Tokenizer | Algorithm | Models | Vocabulary format |
|---|---|---|---|
WordPieceTokenizer |
WordPiece (greedy longest-match subword splitting) | BERT, DistilBERT, MiniLM, SentenceTransformer | vocab.txt (one token per line) |
BpeTokenizer |
Byte-level BPE (iterative pair merging) | CLIP | vocab.json + merges.txt |
DecodingBpeTokenizer |
Byte-level BPE with decode support | GPT-2, SmolLM2, Qwen2.5, BART | vocab.json + merges.txt |
SentencePieceBpeTokenizer |
SentencePiece BPE (Unicode-native, ▁ space prefix) |
Gemma, LLaMA, TinyLlama, MarianMT | tokenizer.json |
UnigramTokenizer |
SentencePiece Unigram (Viterbi optimal segmentation) | Flan-T5, CoEdIT, T5SqlGenerator | tokenizer.json |
WordPiece¶
WordPiece breaks unknown words into known subword units using a ## continuation prefix. For example, "unbelievable" becomes ["un", "##believ", "##able"] — preserving meaning even for out-of-vocabulary words.
The encoding pipeline:
- Lowercase and split on whitespace/punctuation
- Split each word into subwords via greedy longest-match against the vocabulary
- Wrap with
[CLS]and[SEP]special tokens - Truncate to
maxLengthif needed
Tokenizer tokenizer = WordPieceTokenizer.fromVocabFile(Path.of("vocab.txt"));
EncodedInput encoded = tokenizer.encode("Hello world!", 128);
// encoded.inputIds() → [101, 7592, 2088, 999, 102]
// encoded.attentionMask() → [1, 1, 1, 1, 1]
// encoded.tokenTypeIds() → [0, 0, 0, 0, 0]
WordPiece also supports sentence pair encoding for cross-encoder models (e.g., rerankers):
EncodedInput encoded = tokenizer.encode("What is Java?", "Java is a programming language.", 128);
// Format: [CLS] textA [SEP] textB [SEP]
// tokenTypeIds: 0 for textA tokens, 1 for textB tokens
Note
The built-in WordPieceTokenizer applies unconditional lowercasing, matching bert-base-uncased and distilbert-base-uncased. It is not suitable for cased models.
Byte-Pair Encoding (BPE)¶
BPE starts from individual characters and iteratively merges the most frequent pairs into subwords. CLIP's variant adds byte-level encoding (handling any UTF-8 input) and </w> end-of-word markers.
The encoding pipeline:
- Lowercase and normalize whitespace
- Split via regex into words, contractions, digits, and punctuation
- Encode each byte via GPT-2's byte-to-unicode table
- Apply BPE merges according to the priority table
- Wrap with
<|startoftext|>and<|endoftext|>special tokens - Pad to
maxLength(default: 77 for CLIP)
Tokenizer tokenizer = BpeTokenizer.fromFiles(
Path.of("vocab.json"), Path.of("merges.txt"));
EncodedInput encoded = tokenizer.encode("a photo of a cat");
// encoded.inputIds() → [49406, 320, 1125, 539, 320, 2368, 49407, 0, ...]
// encoded.attentionMask() → [1, 1, 1, 1, 1, 1, 1, 0, ...]
Decoding BPE¶
DecodingBpeTokenizer extends BpeTokenizer with the ability to decode token IDs back to text. This is required by generative models that produce token IDs as output — autoregressive (GPT-2, SmolLM2, Qwen2.5) and encoder-decoder (BART).
The encoding pipeline is the same as BPE. The decoding pipeline:
- Reverse vocabulary lookup (ID → token string)
- Concatenate tokens
- Decode GPT-2 byte-to-unicode mapping back to raw bytes
- Interpret bytes as UTF-8
DecodingBpeTokenizer tokenizer = DecodingBpeTokenizer.fromFiles(
Path.of("vocab.json"), Path.of("merges.txt"));
// Encode
EncodedInput encoded = tokenizer.encode("Hello world");
// Decode
String text = tokenizer.decode(new int[]{15496, 995}); // "Hello world"
// Single token (for streaming)
String fragment = tokenizer.decode(15496); // "Hello"
SentencePiece BPE¶
SentencePiece BPE operates directly on Unicode text without a pre-tokenization regex. Word boundaries are encoded using the ▁ (U+2581) space prefix, and characters not in the vocabulary fall back to <0xNN> byte tokens.
The encoding pipeline:
- Prepend
▁and replace all spaces with▁ - Split on special tokens (added tokens preserved atomically)
- Split into characters and apply BPE merges
- Characters not in vocab → UTF-8 bytes →
<0xNN>token IDs
SentencePieceBpeTokenizer is used automatically by OnnxTextGenerator.tinyLlama(), OnnxTextGenerator.gemma2(), and MarianTranslator. For custom SentencePiece models, use the TokenizerProvider:
try (var gen = OnnxTextGenerator.builder()
.modelId("my-org/my-sentencepiece-model")
.tokenizerProvider(SentencePieceBpeTokenizer.provider())
.chatTemplate(msg -> "<start>" + msg + "<end>")
.build()) {
gen.generate("Hello", token -> System.out.print(token));
}
Unigram¶
The Unigram algorithm assigns a log-probability score to every token in the vocabulary and uses dynamic programming (Viterbi) to find the segmentation that maximizes the total score. This is used by T5-family models (Flan-T5, CoEdIT, T5SqlGenerator).
The encoding pipeline:
- Prepend
▁and replace all spaces with▁ - Split on added tokens (special tokens preserved atomically)
- Run Viterbi to find the optimal segmentation
- Unmapped characters → UTF-8 bytes →
<0xNN>token IDs
UnigramTokenizer is used automatically by FlanT5TextGenerator, CoeditGrammarCorrector, and T5SqlGenerator. It reads vocabulary and scores from tokenizer.json.
Default behavior¶
You don't need to configure tokenizers for standard use. Every NLP wrapper auto-loads the correct tokenizer from the model directory during .build():
// WordPiece loaded automatically from vocab.txt
try (var classifier = DistilBertTextClassifier.builder()
.modelId("inference4j/distilbert-base-uncased-finetuned-sst-2-english")
.build()) {
classifier.classify("This movie was fantastic!");
}
// BPE loaded automatically from vocab.json + merges.txt
try (var classifier = ClipClassifier.builder().build()) {
classifier.classify(Path.of("photo.jpg"), List.of("cat", "dog", "bird"));
}
The wrapper knows which tokenizer algorithm its model expects and which vocabulary files to look for.
Supplying a custom tokenizer¶
All NLP builders expose a .tokenizer() method that lets you override the default:
Tokenizer myTokenizer = WordPieceTokenizer.fromVocabFile(Path.of("/path/to/my/vocab.txt"));
try (var embedder = SentenceTransformerEmbedder.builder()
.modelId("my-custom-model")
.tokenizer(myTokenizer)
.build()) {
float[] embedding = embedder.encode("Hello, world!");
}
When you provide a tokenizer, the wrapper skips auto-loading and uses yours directly.
When to use a custom tokenizer¶
Shared instances — if you're running multiple wrappers against the same vocabulary, share a single tokenizer instance to avoid loading the vocabulary file multiple times:
Tokenizer shared = WordPieceTokenizer.fromVocabFile(Path.of("vocab.txt"));
try (var embedder = SentenceTransformerEmbedder.builder()
.modelId("my-model").tokenizer(shared).build();
var classifier = DistilBertTextClassifier.builder()
.modelId("my-model").tokenizer(shared).build()) {
// Both use the same tokenizer instance
}
Custom vocabulary — if you've fine-tuned a model with a modified vocabulary, point the tokenizer at your custom vocab.txt:
Tokenizer tokenizer = WordPieceTokenizer.fromVocabFile(Path.of("my-custom-vocab.txt"));
try (var classifier = DistilBertTextClassifier.builder()
.tokenizer(tokenizer)
.modelSource(LocalModelSource.of(Path.of("my-finetuned-model")))
.build()) {
classifier.classify("custom domain text");
}
Testing — supply a mock or stub tokenizer in unit tests to isolate inference logic from tokenization:
Tokenizer stub = text -> new EncodedInput(
new long[]{101, 7592, 102},
new long[]{1, 1, 1},
new long[]{0, 0, 0}
);
The EncodedInput record¶
All tokenizers return an EncodedInput containing the three standard tensors that transformer models expect:
public record EncodedInput(
long[] inputIds, // token IDs from the vocabulary
long[] attentionMask, // 1 for real tokens, 0 for padding
long[] tokenTypeIds // segment IDs (0 for first sentence, 1 for second)
) {}
| Field | Purpose | Example |
|---|---|---|
inputIds |
Maps each token to its vocabulary index | [101, 7592, 2088, 102] |
attentionMask |
Tells the model which positions are real tokens vs padding | [1, 1, 1, 1] |
tokenTypeIds |
Distinguishes sentence A from sentence B in pair tasks | [0, 0, 0, 0] |
Tips¶
- You almost never need to touch tokenizers. The default auto-loading handles standard HuggingFace models out of the box.
- Tokenizer and model must match. A tokenizer trained on one vocabulary will produce wrong token IDs for a model trained on a different vocabulary. Always use the tokenizer that shipped with your model.
maxLengthmatters. Most BERT models use 512 tokens max. CLIP uses 77. Exceeding the model's trained length produces undefined results. The wrappers set this automatically.- WordPiece does not pad, BPE does. WordPiece returns only the actual tokens (variable length). BPE pads to
maxLengthwith zeros. Both behaviors match what their respective model families expect.