How to change the text_encoder for a Stable Diffusion model for fine-tuning?

98 Views Asked by At

I want to use Stable Diffusion model weights to generate class-conditional images- however, I don't want these images to be conditional on a text prompt, but rather on a number of binary class attributes/rows.

In order to do this, I was thinking of using the HuggingFace Diffusers library, as it seemed the most straightforward. My thinking was to replace the CLIP text encoder/tokenizer with a custom encoder which maps the attribute rows into the latent space, however I can't seem to find resources on this online, and was wondering if it was possible/feasible within the Diffusers library.

I understand that the StableDiffusionPipeline is likely too strict, however, I was wondering how I would define a model with these attribute rows as the conditioner for the generation, and how this model could be trained/fine-tuned.

0

There are 0 best solutions below