Whisper Speech-to-Text¶

Work in Progress

WhisperSpeechModel is implemented but the pre-exported model artifact (inference4j/whisper-small-genai) is not yet available on HuggingFace. The onnxruntime Whisper converter has compatibility issues with current Python package versions. This page documents the target API.

Transcribe and translate speech using OpenAI's Whisper models, with automatic chunking for long audio.

See the overview for background on how autoregressive generation differs from single-pass inference.

Quick example¶

try (var whisper = WhisperSpeechModel.builder()
        .modelId("inference4j/whisper-small-genai")
        .build()) {
    System.out.println(whisper.transcribe(Path.of("meeting.wav")).text());
}

Full example¶

import io.github.inference4j.genai.audio.WhisperSpeechModel;
import io.github.inference4j.audio.Transcription;
import java.nio.file.Path;

public class WhisperTranscription {
    public static void main(String[] args) {
        try (var whisper = WhisperSpeechModel.builder()
                .modelId("inference4j/whisper-small-genai")
                .build()) {

            Transcription result = whisper.transcribe(Path.of("meeting.wav"));
            System.out.println(result.text());
        }
    }
}

Translation¶

Whisper can translate speech from any supported language into English:

try (var whisper = WhisperSpeechModel.builder()
        .modelId("inference4j/whisper-small-genai")
        .language("fr")
        .task(WhisperTask.TRANSLATE)
        .build()) {
    Transcription result = whisper.transcribe(Path.of("french-audio.wav"));
    System.out.println(result.text());  // English translation
}

From raw audio data¶

If you already have audio samples as a float array:

try (var whisper = WhisperSpeechModel.builder()
        .modelId("inference4j/whisper-small-genai")
        .build()) {
    float[] samples = loadAudioSamples();
    Transcription result = whisper.transcribe(samples, 16000);
    System.out.println(result.text());
}

Builder options¶

Method	Type	Default	Description
`.modelId(String)`	`String`	—	HuggingFace model ID (required)
`.modelSource(ModelSource)`	`ModelSource`	`HuggingFaceModelSource`	Model resolution strategy
`.language(String)`	`String`	`"en"`	Source language code (e.g., `"fr"`, `"de"`, `"ja"`)
`.task(WhisperTask)`	`WhisperTask`	`TRANSCRIBE`	`TRANSCRIBE` or `TRANSLATE` (to English)
`.maxLength(int)`	`int`	`448`	Maximum number of tokens to generate per chunk
`.temperature(double)`	`double`	`0.0`	Sampling temperature (0 = greedy)
`.topK(int)`	`int`	`0` (disabled)	Top-K sampling
`.topP(double)`	`double`	`0.0` (disabled)	Nucleus sampling

Result type¶

Transcription is a record with:

Field	Type	Description
`text()`	`String`	The transcribed or translated text
`segments()`	`List<Segment>`	Timed segments (when available)

Each Segment contains:

Field	Type	Description
`text()`	`String`	Segment text
`startTime()`	`float`	Start time in seconds
`endTime()`	`float`	End time in seconds

Auto-chunking¶

Whisper processes audio in 30-second windows. For audio longer than 30 seconds, WhisperSpeechModel automatically:

Splits the audio into 30-second chunks
Transcribes each chunk independently
Concatenates the results

This happens transparently — just call transcribe() with any length audio.

How it works¶

Unlike Wav2Vec2 (single-pass CTC), Whisper is an autoregressive encoder-decoder model. The audio is encoded into a mel spectrogram, then the decoder generates text tokens one at a time.

flowchart LR
    A["Audio file"] --> B["onnxruntime-genai<br>mel spectrogram → encoder → decoder loop"]
    B --> C["Transcription"]

All heavy lifting — mel spectrogram computation, encoder forward pass, autoregressive decoding, KV cache, beam search — is handled natively by onnxruntime-genai's C++ layer.

Whisper vs Wav2Vec2¶

	Wav2Vec2	Whisper
Architecture	CTC (single-pass)	Encoder-decoder (autoregressive)
Speed	Fast	Slower (token-by-token)
Accuracy	Good on clean speech	Better on diverse audio
Languages	English (default model)	99 languages
Translation	No	Yes (to English)
Module	`inference4j-core`	`inference4j-genai`

Tips¶

Use WhisperTask.TRANSLATE to translate any language to English in a single step.
Smaller models (tiny, base) are faster but less accurate. The small model is a good balance.
temperature(0.0) (default) gives deterministic, greedy decoding — best for transcription accuracy.
Reuse WhisperSpeechModel instances — each one holds the model in memory.
For short, clean English audio where speed matters, Wav2Vec2 may be a better fit.