ElevenLabs Alignment isn't transcription - it's the per-word timing data ElevenLabs returns for free alongside generated audio. If you're TTS'ing through ElevenLabs anyway, you get perfect word-level caption timing as a side effect of generation.
For TTS-driven workflows (which describes most VideoCue use cases) this is the right answer: no additional cost, no separate inference call, and the timing is precise because the synthesis engine knows exactly when each phoneme fires.
Doesn't apply to non-ElevenLabs audio.
Verdict: best for TTS-driven captioning. 8.6/10.