Speech-to-Text¶

Transcribe audio files using Wav2Vec2, a non-autoregressive CTC model that converts speech to text in a single forward pass.

Quick example¶

try (var recognizer = Wav2Vec2Recognizer.builder().build()) {
    System.out.println(recognizer.transcribe(Path.of("audio.wav")).text());
}

Full example¶

import io.github.inference4j.audio.Wav2Vec2Recognizer;
import io.github.inference4j.audio.Transcription;
import java.nio.file.Path;

public class SpeechToText {
    public static void main(String[] args) {
        try (var recognizer = Wav2Vec2Recognizer.builder().build()) {
            Transcription result = recognizer.transcribe(Path.of("speech.wav"));
            System.out.println(result.text());
        }
    }
}

From raw audio data¶

If you already have audio samples as a float array:

try (var recognizer = Wav2Vec2Recognizer.builder().build()) {
    float[] audioData = loadAudioSamples(); // your audio loading logic
    Transcription result = recognizer.transcribe(audioData, 16000);
    System.out.println(result.text());
}

Builder options¶

Method	Type	Default	Description
`.modelId(String)`	`String`	`inference4j/wav2vec2-base-960h`	HuggingFace model ID
`.modelSource(ModelSource)`	`ModelSource`	`HuggingFaceModelSource`	Model resolution strategy
`.sessionOptions(SessionConfigurer)`	`SessionConfigurer`	default	ONNX Runtime session config
`.vocabulary(Vocabulary)`	`Vocabulary`	auto-loaded from `vocab.json`	CTC vocabulary
`.inputName(String)`	`String`	auto-detected	Input tensor name
`.sampleRate(int)`	`int`	`16000`	Target sample rate (Hz)
`.blankIndex(int)`	`int`	`0`	CTC blank token index
`.wordDelimiter(String)`	`String`	`"\\|"`	Word separator token in vocabulary

Result type¶

Transcription is a record with:

Field	Type	Description
`text()`	`String`	The transcribed text
`segments()`	`List<Segment>`	Timed segments (empty for Wav2Vec2)

Audio requirements¶

Format: WAV files (loaded automatically from Path)
Sample rate: Audio is automatically resampled to the model's target rate (16kHz by default)
Channels: Mono (stereo is downmixed automatically)
Duration: No hard limit, but very long files will use more memory

How it works¶

Wav2Vec2 is a non-autoregressive model — it processes the entire audio waveform in a single forward pass and produces character-level predictions using CTC (Connectionist Temporal Classification) decoding.

The pipeline:

Load and normalize audio from WAV file
Resample to 16kHz if needed
Run a single forward pass through the model
Apply CTC greedy decoding to convert logits to characters
Join characters into words using the word delimiter

Tips¶

The default model (wav2vec2-base-960h) is trained on English LibriSpeech data. For other languages, use an appropriate fine-tuned model.
Wav2Vec2 works best with clean speech. For noisy audio, consider preprocessing with VAD to extract speech segments first — see Voice Activity Detection.
This is a CTC model (single-pass), not an autoregressive model. It's fast but may be less accurate on complex audio. For multilingual support or translation, see Whisper.