Skip to content

Tokenizers

Transformer models don't work with raw text — they expect sequences of integer token IDs mapped from a fixed vocabulary. A tokenizer bridges this gap: it splits text into tokens, maps each token to its vocabulary index, and produces the metadata tensors (attention_mask, token_type_ids) that the model expects as input.

inference4j handles tokenization automatically. When you call classifier.classify("some text"), the wrapper tokenizes the input, runs inference, and decodes the output — you never touch a token ID. But if you need to customize tokenization or use tokenizers directly, this guide covers how.

Built-in tokenizers

inference4j ships five tokenizer implementations, covering the most common algorithms in production transformer models:

Tokenizer Algorithm Models Vocabulary format
WordPieceTokenizer WordPiece (greedy longest-match subword splitting) BERT, DistilBERT, MiniLM, SentenceTransformer vocab.txt (one token per line)
BpeTokenizer Byte-level BPE (iterative pair merging) CLIP vocab.json + merges.txt
DecodingBpeTokenizer Byte-level BPE with decode support GPT-2, SmolLM2, Qwen2.5, BART vocab.json + merges.txt
SentencePieceBpeTokenizer SentencePiece BPE (Unicode-native, space prefix) Gemma, LLaMA, TinyLlama, MarianMT tokenizer.json
UnigramTokenizer SentencePiece Unigram (Viterbi optimal segmentation) Flan-T5, CoEdIT, T5SqlGenerator tokenizer.json

WordPiece

WordPiece breaks unknown words into known subword units using a ## continuation prefix. For example, "unbelievable" becomes ["un", "##believ", "##able"] — preserving meaning even for out-of-vocabulary words.

The encoding pipeline:

  1. Lowercase and split on whitespace/punctuation
  2. Split each word into subwords via greedy longest-match against the vocabulary
  3. Wrap with [CLS] and [SEP] special tokens
  4. Truncate to maxLength if needed
Tokenizer tokenizer = WordPieceTokenizer.fromVocabFile(Path.of("vocab.txt"));
EncodedInput encoded = tokenizer.encode("Hello world!", 128);
// encoded.inputIds()      → [101, 7592, 2088, 999, 102]
// encoded.attentionMask() → [1, 1, 1, 1, 1]
// encoded.tokenTypeIds()  → [0, 0, 0, 0, 0]

WordPiece also supports sentence pair encoding for cross-encoder models (e.g., rerankers):

EncodedInput encoded = tokenizer.encode("What is Java?", "Java is a programming language.", 128);
// Format: [CLS] textA [SEP] textB [SEP]
// tokenTypeIds: 0 for textA tokens, 1 for textB tokens

Note

The built-in WordPieceTokenizer applies unconditional lowercasing, matching bert-base-uncased and distilbert-base-uncased. It is not suitable for cased models.

Byte-Pair Encoding (BPE)

BPE starts from individual characters and iteratively merges the most frequent pairs into subwords. CLIP's variant adds byte-level encoding (handling any UTF-8 input) and </w> end-of-word markers.

The encoding pipeline:

  1. Lowercase and normalize whitespace
  2. Split via regex into words, contractions, digits, and punctuation
  3. Encode each byte via GPT-2's byte-to-unicode table
  4. Apply BPE merges according to the priority table
  5. Wrap with <|startoftext|> and <|endoftext|> special tokens
  6. Pad to maxLength (default: 77 for CLIP)
Tokenizer tokenizer = BpeTokenizer.fromFiles(
        Path.of("vocab.json"), Path.of("merges.txt"));
EncodedInput encoded = tokenizer.encode("a photo of a cat");
// encoded.inputIds()      → [49406, 320, 1125, 539, 320, 2368, 49407, 0, ...]
// encoded.attentionMask() → [1, 1, 1, 1, 1, 1, 1, 0, ...]

Decoding BPE

DecodingBpeTokenizer extends BpeTokenizer with the ability to decode token IDs back to text. This is required by generative models that produce token IDs as output — autoregressive (GPT-2, SmolLM2, Qwen2.5) and encoder-decoder (BART).

The encoding pipeline is the same as BPE. The decoding pipeline:

  1. Reverse vocabulary lookup (ID → token string)
  2. Concatenate tokens
  3. Decode GPT-2 byte-to-unicode mapping back to raw bytes
  4. Interpret bytes as UTF-8
DecodingBpeTokenizer tokenizer = DecodingBpeTokenizer.fromFiles(
        Path.of("vocab.json"), Path.of("merges.txt"));

// Encode
EncodedInput encoded = tokenizer.encode("Hello world");

// Decode
String text = tokenizer.decode(new int[]{15496, 995}); // "Hello world"

// Single token (for streaming)
String fragment = tokenizer.decode(15496); // "Hello"

SentencePiece BPE

SentencePiece BPE operates directly on Unicode text without a pre-tokenization regex. Word boundaries are encoded using the (U+2581) space prefix, and characters not in the vocabulary fall back to <0xNN> byte tokens.

The encoding pipeline:

  1. Prepend and replace all spaces with
  2. Split on special tokens (added tokens preserved atomically)
  3. Split into characters and apply BPE merges
  4. Characters not in vocab → UTF-8 bytes → <0xNN> token IDs

SentencePieceBpeTokenizer is used automatically by OnnxTextGenerator.tinyLlama(), OnnxTextGenerator.gemma2(), and MarianTranslator. For custom SentencePiece models, use the TokenizerProvider:

try (var gen = OnnxTextGenerator.builder()
        .modelId("my-org/my-sentencepiece-model")
        .tokenizerProvider(SentencePieceBpeTokenizer.provider())
        .chatTemplate(msg -> "<start>" + msg + "<end>")
        .build()) {
    gen.generate("Hello", token -> System.out.print(token));
}

Unigram

The Unigram algorithm assigns a log-probability score to every token in the vocabulary and uses dynamic programming (Viterbi) to find the segmentation that maximizes the total score. This is used by T5-family models (Flan-T5, CoEdIT, T5SqlGenerator).

The encoding pipeline:

  1. Prepend and replace all spaces with
  2. Split on added tokens (special tokens preserved atomically)
  3. Run Viterbi to find the optimal segmentation
  4. Unmapped characters → UTF-8 bytes → <0xNN> token IDs

UnigramTokenizer is used automatically by FlanT5TextGenerator, CoeditGrammarCorrector, and T5SqlGenerator. It reads vocabulary and scores from tokenizer.json.

Default behavior

You don't need to configure tokenizers for standard use. Every NLP wrapper auto-loads the correct tokenizer from the model directory during .build():

// WordPiece loaded automatically from vocab.txt
try (var classifier = DistilBertTextClassifier.builder()
        .modelId("inference4j/distilbert-base-uncased-finetuned-sst-2-english")
        .build()) {
    classifier.classify("This movie was fantastic!");
}

// BPE loaded automatically from vocab.json + merges.txt
try (var classifier = ClipClassifier.builder().build()) {
    classifier.classify(Path.of("photo.jpg"), List.of("cat", "dog", "bird"));
}

The wrapper knows which tokenizer algorithm its model expects and which vocabulary files to look for.

Supplying a custom tokenizer

All NLP builders expose a .tokenizer() method that lets you override the default:

Tokenizer myTokenizer = WordPieceTokenizer.fromVocabFile(Path.of("/path/to/my/vocab.txt"));

try (var embedder = SentenceTransformerEmbedder.builder()
        .modelId("my-custom-model")
        .tokenizer(myTokenizer)
        .build()) {
    float[] embedding = embedder.encode("Hello, world!");
}

When you provide a tokenizer, the wrapper skips auto-loading and uses yours directly.

When to use a custom tokenizer

Shared instances — if you're running multiple wrappers against the same vocabulary, share a single tokenizer instance to avoid loading the vocabulary file multiple times:

Tokenizer shared = WordPieceTokenizer.fromVocabFile(Path.of("vocab.txt"));

try (var embedder = SentenceTransformerEmbedder.builder()
            .modelId("my-model").tokenizer(shared).build();
     var classifier = DistilBertTextClassifier.builder()
            .modelId("my-model").tokenizer(shared).build()) {
    // Both use the same tokenizer instance
}

Custom vocabulary — if you've fine-tuned a model with a modified vocabulary, point the tokenizer at your custom vocab.txt:

Tokenizer tokenizer = WordPieceTokenizer.fromVocabFile(Path.of("my-custom-vocab.txt"));
try (var classifier = DistilBertTextClassifier.builder()
        .tokenizer(tokenizer)
        .modelSource(LocalModelSource.of(Path.of("my-finetuned-model")))
        .build()) {
    classifier.classify("custom domain text");
}

Testing — supply a mock or stub tokenizer in unit tests to isolate inference logic from tokenization:

Tokenizer stub = text -> new EncodedInput(
    new long[]{101, 7592, 102},
    new long[]{1, 1, 1},
    new long[]{0, 0, 0}
);

The EncodedInput record

All tokenizers return an EncodedInput containing the three standard tensors that transformer models expect:

public record EncodedInput(
    long[] inputIds,       // token IDs from the vocabulary
    long[] attentionMask,  // 1 for real tokens, 0 for padding
    long[] tokenTypeIds    // segment IDs (0 for first sentence, 1 for second)
) {}
Field Purpose Example
inputIds Maps each token to its vocabulary index [101, 7592, 2088, 102]
attentionMask Tells the model which positions are real tokens vs padding [1, 1, 1, 1]
tokenTypeIds Distinguishes sentence A from sentence B in pair tasks [0, 0, 0, 0]

Tips

  • You almost never need to touch tokenizers. The default auto-loading handles standard HuggingFace models out of the box.
  • Tokenizer and model must match. A tokenizer trained on one vocabulary will produce wrong token IDs for a model trained on a different vocabulary. Always use the tokenizer that shipped with your model.
  • maxLength matters. Most BERT models use 512 tokens max. CLIP uses 77. Exceeding the model's trained length produces undefined results. The wrappers set this automatically.
  • WordPiece does not pad, BPE does. WordPiece returns only the actual tokens (variable length). BPE pads to maxLength with zeros. Both behaviors match what their respective model families expect.