How to combine multiple videos with AVMutableComposition without loosing audio sync?

300 Views Asked by At

I'm trying to write video exporter which can combine N videos and place them side by side. Below example is with 3 videos. Videos are recorded WebRTC streams of speakers, and each has a bit different frame rate, possibly also recorded with variable frame rate as RTC videos can drop rate at times.

The problem is when merging videos in hstack side by side, the audio gets de-synchronized with its corresponding videos (audio no longer follows the lips of the speaker after some time)

I tried also to pre-export individual videos with ffmpeg to constant frame rate, but de-sync still occurs. Similar audio de-sync happened also if I tried to use hstack filter with ffmpeg so at the end I gave up trying with it... but to my 'horror' same de-sync happens also when combining with AVFoundation.

Any advice how to get the audios in sync with combined video?

func hstackVideos() {
    let videoPaths: [String] = [
        "path/to/video1.mp4",
        "path/to/video2.mp4",
        "path/to/video3.mp4",
    ]

    let composition = AVMutableComposition()

    let assetInfos: [(AVURLAsset, AVAssetTrack, AVMutableCompositionTrack, AVAssetTrack, AVMutableCompositionTrack)] = videoPaths.map {
        let asset = AVURLAsset(url: URL(fileURLWithPath: $0))
        let track = composition.addMutableTrack(withMediaType: AVMediaType.video, preferredTrackID: kCMPersistentTrackID_Invalid)!
        let videoAssetTrack = asset.tracks(withMediaType: .video)[0]
        try! track.insertTimeRange(videoAssetTrack.timeRange, of: videoAssetTrack, at: CMTime.zero)
        let audioTrack = composition.addMutableTrack(withMediaType: .audio, preferredTrackID: kCMPersistentTrackID_Invalid)!
        let audioAssetTrack = asset.tracks(withMediaType: .audio)[0]
        try! audioTrack.insertTimeRange(audioAssetTrack.timeRange, of: audioAssetTrack, at: CMTime.zero)
        return (asset, videoAssetTrack, track, audioAssetTrack, audioTrack)
    }

    let stackComposition = AVMutableVideoComposition()

    stackComposition.renderSize = CGSize(width: 512, height: 288)
    stackComposition.frameDuration = CMTime(seconds: 1/30, preferredTimescale: 600)
    // stackComposition.frameDuration = assetInfos[0].1.minFrameDuration

    var i = 0
    let instructions: [AVMutableVideoCompositionLayerInstruction] = assetInfos.map { (asset, assetTrack, compTrack, _, _) in
        let lInst = AVMutableVideoCompositionLayerInstruction(assetTrack: compTrack)
        let w: CGFloat = 512/CGFloat(assetInfos.count)
        let inRatio = assetTrack.naturalSize.width / assetTrack.naturalSize.height
        let cropRatio = w / 288
        let scale: CGFloat
        if inRatio < cropRatio {
            scale = w / assetTrack.naturalSize.width
        } else {
            scale = 288 / assetTrack.naturalSize.height
        }
        lInst.setCropRectangle(CGRect(x: w/scale, y: 0, width: w/scale, height: 288/scale), at: CMTime.zero)
        let transform = CGAffineTransform(scaleX: scale, y: scale)
        let t2 = transform.concatenating(CGAffineTransform(translationX: -w + CGFloat(i)*w, y: 0))
        lInst.setTransform(t2, at: CMTime.zero)
        i += 1
        return lInst
    }

    let inst = AVMutableVideoCompositionInstruction()
    inst.timeRange = CMTimeRange(start: CMTime.zero, duration: assetInfos[0].0.duration)
    inst.layerInstructions = instructions

    stackComposition.instructions = [inst]

    let exporter = AVAssetExportSession(asset: composition, presetName: AVAssetExportPresetHighestQuality)!
    let outPath = "path/to/finalVideo.mp4"
    let outUrl = URL(fileURLWithPath: outPath)
    try? FileManager.default.removeItem(at: outUrl)
    exporter.outputURL = outUrl
    exporter.videoComposition = stackComposition
    exporter.outputFileType = .mp4
    exporter.shouldOptimizeForNetworkUse = true

    let group = DispatchGroup()
    group.enter()
    exporter.exportAsynchronously(completionHandler: {
        switch exporter.status {
            case .completed:
                print("SUCCESS!")
                if exporter.error != nil {
                    print("Error: \(String(describing: exporter.error))")
                    print("Description: \(exporter.description)")
                }
                group.leave()
            case .exporting:
                let progress = exporter.progress
                print("Progress: \(progress)")
                
            case .failed:
                print("Error: \(String(describing: exporter.error))")
                print("Description: \(exporter.description)")
                group.leave()
            default:
                break
        }
    })

    group.wait()
}

[Update 29/07/2021]

I've checked input and output durations of the audio and video tracks. Here are the results (in seconds): Input videos:

  • video 1: (video track: 1086.586, audio track: 1086.483)
  • video 2: (video track: 1086.534, audio track: 1086.473)
  • video 3: (video track: 1086.5, audio track: 1086.483)

Output video had three audio tracks with significantly modified durations: (video track: 1086.5855, a1 track: 1079.208, a2 track: 1083.8826666666666, a3 track: 1086.5855)

I'm also noticing small difference in nominalFrameRate of source and destination audio tracks: (source rates: 46.786236, 46.561222, 46.762463, destination rates: 46.874996, 46.875, 46.875). This could explain duration difference, although I don't know what is frame rate in audio and why the exporter changes it.

I also tried to use AVMutableAudioMix but sync issue was still there.

It seems that input videos use some kind of scaling for duration of audio tracks that get lost when placing in the composition. Any suggestion how to get these?

0

There are 0 best solutions below