Video to Text

Extracts the audio and runs OpenAI Whisper in your browser to transcribe the speech, with timestamps.

What this does

Video to Text extracts the audio from your video, then runs OpenAI's Whisper (base) model on it through Transformers.js — all in your browser, with no upload.

The first time you click Transcribe, the model (~80MB) downloads from the Hugging Face CDN and is cached, so later runs skip the download. Transcription runs in a background worker, so the page stays responsive. Links to the model and libraries are below.

How it works

  1. 1Drop a video (MP4, MOV, WEBM, MKV…).
  2. 2Click Transcribe. The audio is extracted and, on first run, the Whisper model downloads once (~80MB) and caches.
  3. 3Read the transcript, then copy it or download it as .txt or .srt subtitles.

Built with open source

  • Transformers.js Hugging Face's library for running ONNX machine-learning models in the browser, on WebGPU or WebAssembly. The model weights download from the Hugging Face CDN on first use and are cached. · Apache-2.0
  • Whisper base (OpenAI) OpenAI's multilingual speech-recognition model, transcribing audio to text. · MIT
  • Mediabunny Converts and edits video and audio in the browser via WebCodecs. Add-on encoders cover MP3, AAC, and FLAC. · MPL-2.0

Frequently asked questions