I'm trying to write the output of the Phenaki make_video to an mp4 file. I'm using this Phenaki implementation from github https://github.com/lucidrains/phenaki-pytorch/search?q=make_video
phenaki = Phenaki(
cvivit = cvivit,
maskgit = maskgit
)
entire_video, scenes = make_video(phenaki, texts = [
'blah blah',
], num_frames = (17, 14, 14), prime_lengths = (5, 5))
entire_video.shape # (1, 3, 17 + 14 + 14 = 45, 256, 256)
torchvision.io.write_video(filename= "test.mp4", video_array= entire_video, fps=24)
The error I'm getting is
File "/.../GitHub/phenaki-pytorch/run.py", line 49, in <module>
torchvision.io.write_video(filename= "test.mp4", video_array= entire_video, fps=24)
File "/opt/homebrew/lib/python3.10/site-packages/torchvision/io/video.py", line 132, in write_video
frame = av.VideoFrame.from_ndarray(img, format="rgb24")
File "av/video/frame.pyx", line 408, in av.video.frame.VideoFrame.from_ndarray
File "av/utils.pyx", line 72, in av.utils.check_ndarray
ValueError: Expected numpy array with ndim `3` but got `4`
What am I doing wrong? Why is the numpy array expected to be 3 dimensions for av.VideoFrame.from_ndarray?
According to write_video documentation,
video_array
argument format is "tensor containing the individual frames, as a uint8 tensor in [T, H, W, C] format".The dimensions of
entire_video
is (1, 3, 45, 256, 128), so there are 5 dimensions instead of 4 dimensions.The exception says ndim
3
but got4
(not 4 and 5) because the dimensions mismatch is detected in an internal loop.The order of the dimensions is also wrong (
3
applies number of color channels, should be the last dimension).The type of
entire_video
is also wrong - the type isfloat32
instead ofuint8
.Assuming the
entire_video
resides in the GPU memory, we also have to copy the tensor to the CPU memory before usingwrite_video
.Before using
write_video
, we may apply the following stages:Copy the video from the GPU memory to CPU memory (and remove redundant axis):
Convert from
float32
touint8
applying offset and scale.The following code uses global minimum and maximum (the conversion is not optimal - used as example):
Reorder the axes to be ordered as [T, H, W, C]:
First axis applies frame index (shape value is 45 when there are 45 video frames).
Second axis applies row index (shape value is 256 when there are 256 rows in each frame).
Third axis applies column index (shape value is 128 when there are 128 columns in each frame).
Fourth axis applies applies color channel (shape value is 3, because there are 3 color channels - red, green and blue).
Complete code sample:
After all that, the created video file looks like random noise...
It looks like this is the output of
make_video
and not related to the subject of the post.