Background: I have video clips and audio tracks that I want to sync with said videos.
From the video clips, I'll extract a reference audio track. I also have another track that I want to synchronize with the reference track. The desync comes from editing, which altered the intervals for each cutscene.
I need to manipulate the target track to look like (sound like, in this case) the ref track. This amounts to adding or removing silence at the correct locations. This could be done manually, but it'd be extremely tedious. So I want to be able to determine these locations programatically.
Example:
0 1 2
012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)
Output:
[(2,6), (5,9) # part1
(13, 17), (14, 18)] # part2
My idea is, starting from the beginning:
Fingerprint 2 large chunks* of audio and see if they match:
If yes: move on to the next chunk
If not:
Go down both tracks looking for the first non-silent portion of each
Offset the target to match the original
Go back to the beginning of the loop
# * chunk size determined by heuristics and modifiable
The main problem here is sound matching and fingerprinting are fuzzy and relatively expensive operations.
Ideally I want to them as few times as possible. Ideas?
If you can reliably distinguish silence from non-silence as you suggest and if the only differences are insertions of silence, then it seems the only non-trivial case is where silence is inserted where there was none before:
If you can make your chunk size adaptive to the silence, your algorithm should be fine. That is, if your chunk size is equivalent to two characters in the above example, your algorithm would recognize "pa" matches "pa" and "rt" matches "rt" but for the third chunk it must recognize the silence in
synand adapt the chunk size to compare "1" to "1" instead of "1p" to "1-".For more complicated edits, you might be able to adapt a weighted Shortest Edit Distance algorithm with removing silence have 0 cost.