I want to segment a video transcript into chapters based on the content of each line of speech. The transcript would be used to generate a series of start and end timestamps for each chapter. This is similar to how YouTube now "auto-chapters" videos.
Example .srt transcript:
...
70
00:02:53,640 --> 00:02:54,760
All right, coming in at number five,
71
00:02:54,760 --> 00:02:57,640
we have another habit that saves me around 15 minutes a day
...
I have had minimal luck doing this with ChatGPT as it finds it difficult to both segment by topic and recollect start and end timestamps accurately. I am now exploring whether there are other options for doing this.
I know topic modeling based on time series is possible with some python libraries. I have also read about text tiling as another option. What options are there for achieving an outcome like this?
Note: The format above (.srt) is not necessary. It's just the idea that the input is a list of text-content with start and end timestamps.
First you might need to install these
Pull a Youtube video down and extract the audio
And if you would like to play it to make sure the audio is correct in Jupyter:
Transcribe the audio to text
[out]:
Note: The output of ASR is definitely not perfect but very much usable as a first draft that requires some manual post-edits. Please DO NOT submit the subs directly to your fav fan-subs forum, it'll surely fail the QC moderator.
But how about the timestamps?
Then we need a little more sophistication than crudely pulling the audio and then doing Automatic Speech Recognition (ASR)
Here's a good article: https://blog.searce.com/generate-srt-file-subtitles-using-google-clouds-speech-to-text-api-402b2f1da3bd
And also we are blessed in today's age of AI... Tada: https://github.com/linto-ai/whisper-timestamped
Then in code:
[out]:
You've gone so far with some fine-grained JSON and timestamps, how about some code to convert the JSON to .srt file?
[out]:
小等咧! (Wait a minute!), you've hard-coded the hours and minutes... What do I do for longer videos?
I'm sure with the JSON output from
whisper-timestamped
you can easily figure out the conversion. Hint:from datetime import timedelta; str(timedelta(seconds=float(start)))
Have fun munging the data to the desired format you need!