Speech-to-Text

AIモデル

APIコストなしでオーディオをローカル文字起こし

できること

Local transcription — Convert speech to text completely offline, no API key required

Multiple model sizes — tiny (fastest) → base → small → medium → large (most accurate)

Output formats — Plain text, SRT subtitles, VTT captions, or JSON with timestamps

Translation mode — Translate any language audio directly to English text

Wide format support — WAV, MP3, M4A, FLAC, OGG, and more

Auto model caching — Downloads models on first use, fully offline after that

"Transcribe this podcast.mp3 using the medium model"

"Convert this interview to SRT subtitles"

"Transcribe my voice memo and translate it to English"

"Generate VTT captions for this video's audio track"

"Use the large model for this important lecture recording"

"Get JSON output with word-level timestamps"

tiny = fast but rough, small = good balance, medium = professional quality, large = maximum accuracy

First run downloads the model (40MB–3GB depending on size), then fully offline

SRT/VTT formats include timestamps for subtitle syncing

Translation mode outputs English regardless of input language

JSON output includes segment-level and word-level timing data

Works completely offline after initial model download — great for privacy