I am currently experimenting to see if I can create a time-lapse-style animation using a text-to-image model, specifically Stable Diffusion but the results are very randomly generated images yielding poor results. The goal is to generate a sequence of images depicting a cow in a pasture over the course of a day, with changes in sunlight and cloud positions, but keeping the cow and the field consistent across frames.
Here's a brief overview of my approach:
I use specific text prompts for different times of the day (morning, midday, afternoon, etc.) with detailed descriptions of the cow and the field. For each prompt, I generate an image using the Stable Diffusion model. The resulting images are then compiled into a GIF to create a time-lapse effect. However, I am facing a challenge: the generated images lack uniformity, especially in the positioning and appearance of the cow and the field. Each image seems to treat the cow and the field independently, leading to a lack of continuity that disrupts the time-lapse effect.
Here's an example of the type of prompts I am using:
"A consistent image of a brown cow eating green grass in the same field, with the sun and clouds moving in the sky, time of day: [specific time]"
I would greatly appreciate any insights or suggestions on:
- How to improve the uniformity of the generated images, so that the cow and field remain consistent across frames.
- Are there specific techniques or adjustments in the use of text-to-image models that can help achieve this?
- Any alternative approaches or models that are more suited for this kind of task?
This is my whole script running on Google collab:
# Install required libraries
!pip install diffusers transformers torch imageio
# Import necessary libraries
from diffusers import StableDiffusionPipeline
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
from io import BytesIO
from IPython.display import display
# Initialize the CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Function to create a series of prompts
def create_day_cycle_prompts(num_frames):
times_of_day = ["morning", "midday", "afternoon", "evening", "sunset"]
prompts = []
for i in range(num_frames):
time = times_of_day[i % len(times_of_day)]
prompt = (f"A consistent black and white pencil sketch of a brown cow eating green grass in the same field, "
f"with the sun and moving clouds in the sky, time of day: {time}")
prompts.append(prompt)
return prompts
num_frames = 50
prompts = create_day_cycle_prompts(num_frames)
# Load the Stable Diffusion model
device = "cuda"
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipe.to(device)
# Generate images from prompts
def generate_image_from_text(text):
with torch.autocast(device):
image = pipe(text, guidance_scale=15.0).images[0] # Increased guidance scale
buffer = BytesIO()
image.save(buffer, format="PNG")
buffer.seek(0)
pil_image = Image.open(buffer)
return pil_image
# Generate, save, and display images
for idx, text in enumerate(prompts):
image = generate_image_from_text(text)
image.save(f"frame_{idx}.png")
print(f"Frame {idx}:")
display(image)
print("All images generated, saved, and displayed.")