How to convert messed-up .vtt sub file from youtube-dl --write-auto-sub download?

1.3k Views Asked by At

My goal is to download a YouTube video with auto-generated subtitles in a separate file like .vtt,.srt, etc.

I am currently trying to achieve this with youtube-dl but I am open to other solutions if needed.

When I run the following command, it downloads the video as .mp4 (which is fine) and a separate .vtt file, but the .vtt seems to be messed-up somehow and displays all the text for the whole clip at once instead of the specified times.

Command I am running:

youtube-dl --write-auto-sub https://www.youtube.com/watch?v=Roc89oOZOF4&list=PLJBo3iyb1U0eNNN4Dij3N-d0rCJpMyAKQ&index=45

Downloads this .vtt:

WEBVTT
Kind: captions
Language: en

00:00:05.960 --> 00:00:08.290 align:start position:0%
 
thank <00:00:06.003><c>you  </c><00:00:06.046><c>ah </c><00:00:06.089><c>crap  </c><00:00:06.132><c>well </c><00:00:06.175><c>looks </c><00:00:06.218><c>like </c><00:00:06.261><c>the </c><00:00:06.304><c>good </c><00:00:06.347><c>Lord </c><00:00:06.390><c>just </c><00:00:06.433><c>sent  </c><00:00:06.476><c>me </c><00:00:06.519><c>a </c><00:00:06.562><c>conversation </c><00:00:06.605><c>starter </c><00:00:06.648><c>come </c><00:00:06.691><c>here  </c><00:00:06.734><c>Jesse </c><00:00:06.777><c>come </c><00:00:06.820><c>get </c><00:00:06.863><c>the </c><00:00:06.906><c>ball  </c><00:00:06.949><c>hmm</c>

00:00:08.290 --> 00:00:10.549 align:start position:0%
thank you  ah crap  well looks like the good Lord just sent  me a conversation starter come here  Jesse come get the ball  hmm
 

00:00:10.549 --> 00:00:13.070 align:start position:0%
 
 

00:00:13.070 --> 00:00:15.470 align:start position:0%
 
 

00:00:15.470 --> 00:00:23.750 align:start position:0%
 
 

00:00:23.750 --> 00:00:23.760 align:start position:0%
 
 

00:00:23.760 --> 00:00:26.480 align:start position:0%
 



I have read that this may be done on purpose by YouTube.

Even if this is true, is there any way to convert this .vtt to a usable format or simply download correctly-formatted auto-generated subtitles from YouTube?

Python, FFMPEG, cmd-line preferred, but anything is helpful!

Thanks! Any and all assistance is greatly appreciated!

2

There are 2 best solutions below

0
On BEST ANSWER

to convert vtt I use this one

#!/bin/bash
for i in *.vtt;
do name=`echo $i | cut -d'.' -f1`;
echo $name;s
ffmpeg -i "$i" "${name}.srt";
done

to removes duplicates

I use this, there are still some problem so you need to check them out

#!/bin/bash
mkdir out
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for f in $( find ./ -iname "*.srt" )
do
OIFS="$IFS"
IFS=$'\n'
  awk '!visited[$0]++' "$f" > "./out/${f%.*}.srt"
IFS="$OIFS"
done
IFS=$SAVEIFS

removes duplicates of 2nd row of previous line from the 1st row of the current line
(see 1st row of 11 [original])

lines are still repeated but now it is the same row
(see 11 and 12)

10
00:00:19,670 --> 00:00:19,680
there is a free wireless internet signal

11
00:00:19,680 --> 00:00:21,769
there is a free wireless internet signal
all across North America and nobody has

12
00:00:21,769 --> 00:00:21,779
all across North America and nobody has

becomes

10
00:00:19,670 --> 00:00:19,680
there is a free wireless internet signal
11
00:00:19,680 --> 00:00:21,769
all across North America and nobody has
12
00:00:21,769 --> 00:00:21,779
all across North America and nobody has

edit: someone already made an app to fix duplicate line find it here https://github.com/bindestriche/srt_fix

2
On

In ttml format, the subtitles are functional. Try :

yt-dlp --write-auto-subs --sub-format ttml --no-playlist "https://www.youtube.com/watch?v=Roc89oOZOF4&list=PLJBo3iyb1U0eNNN4Dij3N-d0rCJpMyAKQ&index=45"
  • yt-dlp is a youtube-dl fork
  • To convert ttml to srt or vtt, you have to add --convert-subs srt or --convert-subs vtt