WhisperX is the open-source captioning stack we recommend most. Built on OpenAI's Whisper-3 but with forced-alignment (precise word-level timestamps), VAD-based chunking, and speaker diarization, it produces broadcast-grade caption tracks.
Self-hostable on a single GPU. Free.
Word-error rate is at the frontier on English, very strong on the top 30 languages. Output is ASR + timing data that drops directly into VideoCue's caption renderer.
Verdict: the open ASR champion. 9.0/10.