Machine Translation¶
Translate text between languages using MarianMT (fixed language pairs) or Flan-T5 (flexible, any-to-any).
Quick example¶
Full example¶
import io.github.inference4j.generation.GenerationResult;
import io.github.inference4j.nlp.MarianTranslator;
public class Translation {
public static void main(String[] args) {
try (var translator = MarianTranslator.builder()
.modelId("inference4j/opus-mt-en-de")
.maxNewTokens(200)
.build()) {
GenerationResult result = translator.translate(
"Machine learning is transforming how we build software.",
token -> System.out.print(token));
System.out.println();
System.out.printf("%d tokens in %,d ms%n",
result.generatedTokens(), result.duration().toMillis());
}
}
}
Flexible translation with Flan-T5¶
FlanT5TextGenerator implements the Translator interface and can translate between any pair of languages using a single model:
import io.github.inference4j.nlp.FlanT5TextGenerator;
import io.github.inference4j.nlp.Language;
try (var translator = FlanT5TextGenerator.flanT5Base()
.maxNewTokens(200)
.build()) {
// English to French
String french = translator.translate("Hello, how are you?",
Language.EN, Language.FR);
// English to German
String german = translator.translate("Hello, how are you?",
Language.EN, Language.DE);
// French to Spanish
String spanish = translator.translate("Bonjour, comment allez-vous?",
Language.FR, Language.ES);
}
Supported languages¶
The Language enum provides constants for the most widely spoken languages. More languages will be added in future releases.
| Constant | Language |
|---|---|
EN |
English |
FR |
French |
DE |
German |
ES |
Spanish |
PT |
Portuguese |
PT_BR |
Brazilian Portuguese |
IT |
Italian |
NL |
Dutch |
CA |
Catalan |
SV |
Swedish |
DA |
Danish |
NO |
Norwegian |
FI |
Finnish |
PL |
Polish |
CS |
Czech |
HR |
Croatian |
RO |
Romanian |
RU |
Russian |
UK |
Ukrainian |
TR |
Turkish |
JA |
Japanese |
KO |
Korean |
AR |
Arabic |
ZH_CN |
Chinese Simplified |
ZH_TW |
Chinese Traditional |
HI |
Hindi |
Each constant provides displayName() (e.g., "Brazilian Portuguese") and isoCode() (e.g., "pt-br").
Builder options¶
| Method | Type | Default | Description |
|---|---|---|---|
.modelId(String) |
String |
— (required for MarianMT) | HuggingFace model ID (e.g., inference4j/opus-mt-en-fr) |
.modelSource(ModelSource) |
ModelSource |
HuggingFaceModelSource |
Model resolution strategy |
.sessionOptions(SessionConfigurer) |
SessionConfigurer |
default | ONNX Runtime session config |
.tokenizerProvider(TokenizerProvider) |
TokenizerProvider |
SentencePieceBpeTokenizer |
Tokenizer construction strategy |
.maxNewTokens(int) |
int |
256 |
Maximum tokens to generate |
.temperature(float) |
float |
0.0 |
Sampling temperature |
.topK(int) |
int |
0 (disabled) |
Top-K sampling |
.topP(float) |
float |
0.0 (disabled) |
Nucleus sampling |
.eosTokenId(int) |
int |
Auto-detected | End-of-sequence token ID |
.addedToken(String) |
String |
— | Register a special token for atomic encoding |
Result type¶
GenerationResult is a record with:
| Field | Type | Description |
|---|---|---|
text() |
String |
The translated text |
promptTokens() |
int |
Number of tokens in the input |
generatedTokens() |
int |
Number of tokens generated |
duration() |
Duration |
Wall-clock generation time |
The convenience method translate(text) returns the translation as a plain String.
Using your own MarianMT model¶
The pre-exported models under inference4j/opus-mt-* work out of the box. If you want to use a different MarianMT language pair (e.g., Helsinki-NLP/opus-mt-en-ja), you'll need to export it yourself.
MarianTranslator expects the model directory to contain:
| File | Description |
|---|---|
encoder_model.onnx |
Encoder ONNX model |
decoder_model.onnx |
Decoder ONNX model |
decoder_with_past_model.onnx |
Decoder with KV cache |
config.json |
Model configuration |
tokenizer.json |
HuggingFace fast tokenizer format |
MarianMT models require tokenizer conversion
MarianMT models on HuggingFace ship with SentencePiece files (source.spm, target.spm) instead of tokenizer.json. You must build tokenizer.json using the model's vocab.json for vocabulary IDs and source.spm for BPE merges.
This is important because MarianMT merges source and target SentencePiece vocabularies into a shared vocab.json with ~65K entries. The raw SentencePieceExtractor produces SPM-internal IDs (0–31999) which differ from the model's actual IDs, so you must use vocab.json for the vocabulary mapping and only extract BPE merges from the SPM model.
import json
from huggingface_hub import hf_hub_download
from optimum.exporters.onnx import main_export
from transformers.convert_slow_tokenizer import SentencePieceExtractor
from tokenizers import Tokenizer
from tokenizers.models import BPE
model_id = "Helsinki-NLP/opus-mt-en-ja"
# 1. Export ONNX models
main_export(
model_name_or_path=model_id,
output="my-model/",
task="text2text-generation-with-past",
)
# 2. Build tokenizer.json from vocab.json + source.spm merges
with open("my-model/vocab.json") as f:
model_vocab = json.load(f)
extractor = SentencePieceExtractor("my-model/source.spm")
_, merges = extractor.extract(None)
tokenizer = Tokenizer(BPE(model_vocab, merges, unk_token="<unk>"))
tokenizer.save("my-model/tokenizer.json")
Only standard opus-mt-* models are supported. The newer opus-mt-tc-big-* variants require target language prefixes (e.g., >>por<<) which MarianTranslator does not handle.
Tips¶
- MarianMT models are specialized for a single language pair (e.g.,
opus-mt-en-frfor English→French). They produce higher quality translations for their specific pair but require a separate model per direction. - Flan-T5 handles any language pair with a single model, making it more flexible but generally lower quality than a dedicated pair-specific model.
- For bidirectional translation, you need two MarianMT models (e.g.,
opus-mt-en-frandopus-mt-fr-en) — or use Flan-T5 which handles both directions. - Use greedy decoding (default
temperature=0) for translation — sampling adds noise without improving quality.