Many Korean videos on YouTube have hard-coded subtitles (e.g. https://youtu.be/Zyd6hAvxTnc).
The desired end result would be the OCR'd subtitles in text format.
I have a semi-manual process of downloading the video with yt-dlp, using ffmpeg to create images (e.g. every 1s), bulk crop by fixed dimensions using ImageMagick (hoping the subtitles don't go multi-line...), OCRing using Tesseract (with mixed results - PowerToy's Text Extractor seems much better, but it's very manual), then removing duplicates.
It's not a great solution.
I've tried using OpenCV but without success.
Does anyone know of either:
a) a tool that does this automatically
b) a better way to automate this process (ideally into a single Python script, ideally with automatic detection of the subtitles rather than fixed cropping).
Thanks!
A possible workflow might be:
ffmpegto convert the MP4 video to static images, e.g.ffmpeg -i myvideo.mp4 -start_number 1 -vf fps=1 frame-%04d.pngeasyocr. On silicon macs you might want to take advantage of apple's hardware accelerated AI chips for livetext detection via hooks such as macOCR.srtmodule might help with this).There is subtitles_extract which attempts to put it all together (a mixture of bash and python) but it's apparently a one-off feat and needs extra tweaking in order to work, yet it gives a rough idea of how this can be done.
Note: you might also want to crop the video with ffmpeg first in order to only work with the area that contains subtitles and avoid recognizing other text in the video such street signs, credits etc.