Thanks for creating this. I can imagine a not-so-distant future where thousands ...

alpe · on March 19, 2017

Thank you.

Indeed, while aeneas was created for ebook-audiobook synchronization, several of its current users are producing closed captions --- because, in most cases, they already have a clean transcript (e.g., speakers provide transcripts to the captioner) or they clean up an automated transcript, derived from an automatic speech recognition system.

alpe · on March 20, 2017

To elaborate a bit further, as indeed the closed captioning applications are very important, from hearing-impaired people to the dyslexic, to second language learners.

Let's think about how a human operator would create captions for a video.

If the transcript is not available, the human will roughly transcribe the video (speech to text/speech recognition), and if expert, it will also segment it into closed captions at the same time (segmentation). Note that the segmentation usually needs to follow certain constraints like a maximum number of characters/second (otherwise the CCs are too long/fast to read) and it might also condense the words actually spoken into less verbose text. On top of this, there are special cases, like marking dramatic pauses or laughter or describing on-stage events. A human being using a CC tool would also get the time alignment basically for free, as she/he would write the CCs while watching the video, pausing it for writing the CC text, and so on.

If the transcript is available, it needs to be segmented into CC (same issues as described above), but once done, a forced aligner like aeneas can be used to get the timing automatically. This is the typical scenario for the aeneas users interested in CC production.

Now, let's think how machines can produce CCs.

If you use speech recognition --- like the auto CC on YouTube --- you can get the transcript automatically (usually with transcription errors, especially on languages less trained), with the timings as well. Segmentation is performed automatically as well in a greedy-like fashion driven by the audio signal, but usually is way inferior than the one produced by an expert captioner. The advantage is that the entire work flow is automated.

However, if some manual labor can be applied, perhaps the best flow is the following: use an ASR to get a rough transcript (e.g., download the auto-CC from YouTube or run your ASR of choice), manually clean it, segment it into CCs [1], and then use a forced aligner like aeneas to get the timings. This flow is available e.g. in the aeneas Web application at [2] and the users say it is faster than writing the CCs from scratch. I would say it strongly depends on whether the ASR phase produces a decent transcript or not.

[1] actually, I am working on an ML-based, NLP library to automate the segmentation (i.e., going from a raw transcript to a sequence of CCs respecting the constraints described above).

[2] https://aeneasweb.org