I am attempting to extend OpenAI's CLIP functionality to semantic video search. Essentially, my objective is to input a text query and get relevant video segments/clips that match the semantic content of the text query. Here's what I've thought so far:
- Extract frames from the video at regular intervals.
- Use CLIP to create embeddings of these frames and the text query.
- Compare the text query embeddings with the video frame embeddings to find matches.
However, this approach seems quite naive, and I feel it might not effectively capture the context in the videos due to the temporal information being lost.
Can anyone share advice on improving this approach? Is there a more efficient or effective way to implement semantic video search with OpenAI's CLIP? Also, I'm wondering about any preprocessing steps, possible optimization strategies, or libraries that could be beneficial for this task.
Any help or guidance would be greatly appreciated. Thanks!
Here's a simplified step-by-step:
Chunk the Video into 1-second Intervals
To divide the video into 1-second chunks, you would typically use a library like
moviepy
oropencv
.Generating the Embeddings
For each 1-second chunk, a series of images are generated, and the embeddings are calculated using the OpenAI CLIP model.
Performing the Search
You can use cosine similarity:
The challenge with this approach however is treating 1 second intervals as a series of frames does not capture the context of the video. They should be treated as moving images.
Mixpeek offers a managed search API that does this:
GET: https://api.mixpeek.com/v1/search?q=people+experiencing+joy
Response:
Further reading and demo: https://learn.mixpeek.com/what-is-semantic-video-search/