How to Achieve More Uniform Image Generation in Time-Lapse Style with Text-to-Image Models?

43 Views Asked by At

I am currently experimenting to see if I can create a time-lapse-style animation using a text-to-image model, specifically Stable Diffusion but the results are very randomly generated images yielding poor results. The goal is to generate a sequence of images depicting a cow in a pasture over the course of a day, with changes in sunlight and cloud positions, but keeping the cow and the field consistent across frames.

Here's a brief overview of my approach:

I use specific text prompts for different times of the day (morning, midday, afternoon, etc.) with detailed descriptions of the cow and the field. For each prompt, I generate an image using the Stable Diffusion model. The resulting images are then compiled into a GIF to create a time-lapse effect. However, I am facing a challenge: the generated images lack uniformity, especially in the positioning and appearance of the cow and the field. Each image seems to treat the cow and the field independently, leading to a lack of continuity that disrupts the time-lapse effect.

Here's an example of the type of prompts I am using:

"A consistent image of a brown cow eating green grass in the same field, with the sun and clouds moving in the sky, time of day: [specific time]"

I would greatly appreciate any insights or suggestions on:

  • How to improve the uniformity of the generated images, so that the cow and field remain consistent across frames.
  • Are there specific techniques or adjustments in the use of text-to-image models that can help achieve this?
  • Any alternative approaches or models that are more suited for this kind of task?

This is my whole script running on Google collab:

# Install required libraries
!pip install diffusers transformers torch imageio

# Import necessary libraries
from diffusers import StableDiffusionPipeline
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
from io import BytesIO
from IPython.display import display

# Initialize the CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Function to create a series of prompts
def create_day_cycle_prompts(num_frames):
    times_of_day = ["morning", "midday", "afternoon", "evening", "sunset"]
    prompts = []
    for i in range(num_frames):
        time = times_of_day[i % len(times_of_day)]
        prompt = (f"A consistent black and white pencil sketch of a brown cow eating green grass in the same field, "
                  f"with the sun and moving clouds in the sky, time of day: {time}")
        prompts.append(prompt)
    return prompts

num_frames = 50
prompts = create_day_cycle_prompts(num_frames)

# Load the Stable Diffusion model
device = "cuda"
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipe.to(device)

# Generate images from prompts
def generate_image_from_text(text): 
    with torch.autocast(device): 
        image = pipe(text, guidance_scale=15.0).images[0]  # Increased guidance scale
        buffer = BytesIO()
        image.save(buffer, format="PNG")
        buffer.seek(0)
        pil_image = Image.open(buffer)
    return pil_image

# Generate, save, and display images
for idx, text in enumerate(prompts):
    image = generate_image_from_text(text)
    image.save(f"frame_{idx}.png")
    print(f"Frame {idx}:")
    display(image)

print("All images generated, saved, and displayed.")
0

There are 0 best solutions below