Let say I have a video with a Vtuber singing songs and talking between songs, for example: https://www.youtube.com/watch?v=GjYSd1ykmFw
I would like to mark the starting time and ending time of each song, but it is a bit time consuming to mark it manually, and it seems there is no existing tools for this. I am considering writing my own tools, but I have no experience in handling sound with program. How should I approach this problem, at a high-level?
Slice the audio into segments. Use a BPM measurement tool to find beats per minute in each segment. Adjacent segments with similar BPM are likely from the same song. For a spoken word segment, we’d expect that no consistent beat can be recognized.