Usage tutorial, step by step
Please add the images you need by dragging and dropping or uploading local images. Models will be processed quickly,
1.Insert the image file you want to create. There are multiple upload modes.
2.For example I will upload on my computer (using local file upload)
3.When you see the image file that has been successfully uploaded,
click the generate button below. Start generating, here because it is running on A100 video memory. It will take longer. The image-to-video effect will appear in as little as 177 seconds after generation.
4.After clicking the generate button, wait for the generation
What is the latest Stable Video Diffusion
Stable Video Diffusion is a stable video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models for 2D image synthesis were transformed into generative video models by inserting temporal layers and fine-tuning them on small, high-quality video datasets. However, training methods in literature vary greatly, and the field has yet to agree on a unified strategy for video data. In this paper, we identify and evaluate three distinct stages for successful training of a video LDM (Latent Diffusion Model): text-to-image pre-training, video pre-training, and high-quality video fine-tuning.
This stable video diffusion model provides a new method for producing high-quality videos. It takes text or images as input and generates realistic videos related to the input content. This model has a wide potential applications in film production, virtual reality, game development and other fields.
Research on Stable Video Diffusion is crucial for improving video generation technology. It provides researchers with a framework to further explore training methods and data strategies for video generation models to enhance the quality and fidelity of generated videos.
Stability AI, a well-known AI drawing company, has finally entered the AI-generated video industry.
On Tuesday this week, Stable Video Diffusion, a video generation model based on Stable Diffusion, came out, and the AI community immediately started discussing it.
Stable Video Diffusion Image-to-Video Model Card
Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.
Model Details
Model Description
(SVD) Image-to-Video is a latent diffusion model trained to generate short video clips from an image conditioning. This model was trained to generate 25 frames at resolution 576×1024 given a context frame of the same size, finetuned from SVD Image-to-Video [14 frames]. We also finetune the widely used f8-decoder for temporal consistency. For convenience, we additionally provide the model with the standard frame-wise decoder here.
- Developed by: Stability AI
- Funded by: Stability AI
- Model type: Generative image-to-video model
- Finetuned from model: SVD Image-to-Video [14 frames]
Model Sources
For research purposes, we recommend our generative-models
Github repository (https://github.com/Stability-AI/generative-models), which implements the most popular diffusion frameworks (both training and inference).
- Repository: https://github.com/Stability-AI/generative-models
- Paper: https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets
Evaluation
The chart above evaluates user preference for SVD-Image-to-Video over GEN-2 and PikaLabs. SVD-Image-to-Video is preferred by human voters in terms of video quality. For details on the user study, we refer to the research paper
Uses
Direct Use
The model is intended for research purposes only. Possible research areas and tasks include
- Research on generative models.
- Safe deployment of models which have the potential to generate harmful content.
- Probing and understanding the limitations and biases of generative models.
- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.
Excluded uses are described below.
Out-of-Scope Use
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. The model should not be used in any way that violates Stability AI’s Acceptable Use Policy.
Limitations and Bias
Limitations
- The generated videos are rather short (<= 4sec), and the model does not achieve perfect photorealism.
- The model may generate videos without motion, or very slow camera pans.
- The model cannot be controlled through text.
- The model cannot render legible text.
- Faces and people in general may not be generated properly.
- The autoencoding part of the model is lossy.
Recommendations
The model is intended for research purposes only.
How to Get Started with the Model
Check out https://github.com/Stability-AI/generative-models
Generative Models by Stability AI
News
November 21, 2023
- We are releasing Stable Video Diffusion, an image-to-video model, for research purposes:
- SVD: This model was trained to generate 14 frames at resolution 576×1024 given a context frame of the same size. We use the standard image encoder from SD 2.1, but replace the decoder with a temporally-aware
deflickering decoder
. - SVD-XT: Same architecture as
SVD
but finetuned for 25 frame generation. - We provide a streamlit demo
scripts/demo/video_sampling.py
and a standalone python scriptscripts/sampling/simple_video_sample.py
for inference of both models. - Alongside the model, we release a technical report.
- SVD: This model was trained to generate 14 frames at resolution 576×1024 given a context frame of the same size. We use the standard image encoder from SD 2.1, but replace the decoder with a temporally-aware
July 26, 2023
- We are releasing two new open models with a permissive
CreativeML Open RAIL++-M
license (see Inference for file hashes):- SDXL-base-1.0: An improved version over
SDXL-base-0.9
. - SDXL-refiner-1.0: An improved version over
SDXL-refiner-0.9
.
- SDXL-base-1.0: An improved version over
July 4, 2023
- A technical report on SDXL is now available here.
June 22, 2023
- We are releasing two new diffusion models for research purposes:
SDXL-base-0.9
: The base model was trained on a variety of aspect ratios on images with resolution 1024^2. The base model uses OpenCLIP-ViT/G and CLIP-ViT/L for text encoding whereas the refiner model only uses the OpenCLIP model.SDXL-refiner-0.9
: The refiner has been trained to denoise small noise levels of high quality data and as such is not expected to work as a text-to-image model; instead, it should only be used as an image-to-image model.
If you would like to access these models for your research, please apply using one of the following links: SDXL-0.9-Base model, and SDXL-0.9-Refiner. This means that you can apply for any of the two links – and if you are granted – you can access both. Please log in to your Hugging Face Account with your organization email to request access. We plan to do a full release soon (July).
The codebase
General Philosophy
Modularity is king. This repo implements a config-driven approach where we build and combine submodules by calling instantiate_from_config()
on objects defined in yaml configs. See configs/
for many examples.
Changelog from the old ldm
codebase
For training, we use PyTorch Lightning, but it should be easy to use other training wrappers around the base modules. The core diffusion model class (formerly LatentDiffusion
, now DiffusionEngine
) has been cleaned up:
- No more extensive subclassing! We now handle all types of conditioning inputs (vectors, sequences and spatial conditionings, and all combinations thereof) in a single class:
GeneralConditioner
, seesgm/modules/encoders/modules.py
. - We separate guiders (such as classifier-free guidance, see
sgm/modules/diffusionmodules/guiders.py
) from the samplers (sgm/modules/diffusionmodules/sampling.py
), and the samplers are independent of the model. - We adopt the “denoiser framework” for both training and inference (most notable change is probably now the option to train continuous time models):
- Discrete times models (denoisers) are simply a special case of continuous time models (denoisers); see
sgm/modules/diffusionmodules/denoiser.py
. - The following features are now independent: weighting of the diffusion loss function (
sgm/modules/diffusionmodules/denoiser_weighting.py
), preconditioning of the network (sgm/modules/diffusionmodules/denoiser_scaling.py
), and sampling of noise levels during training (sgm/modules/diffusionmodules/sigma_sampling.py
).
- Discrete times models (denoisers) are simply a special case of continuous time models (denoisers); see
- Autoencoding models have also been cleaned up.
Installation:
1. Clone the repo
git clone [email protected]:Stability-AI/generative-models.git cd generative-models
2. Setting up the virtualenv
This is assuming you have navigated to the generative-models
root after cloning it.
NOTE: This is tested under python3.10
. For other python versions, you might encounter version conflicts.
PyTorch 2.0# install required packages from pypi python3 -m venv .pt2 source .pt2/bin/activate pip3 install -r requirements/pt2.txt
3. Install sgm
pip3 install .
4. Install sdata
for training
pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata
Packaging
This repository uses PEP 517 compliant packaging using Hatch.
To build a distributable wheel, install hatch
and run hatch build
(specifying -t wheel
will skip building a sdist, which is not necessary).
pip install hatch
hatch build -t wheel
You will find the built package in dist/
. You can install the wheel with pip install dist/*.whl
.
Note that the package does not currently specify dependencies; you will need to install the required packages, depending on your use case and PyTorch version, manually.
Inference
We provide a streamlit demo for text-to-image and image-to-image sampling in scripts/demo/sampling.py
. We provide file hashes for the complete file as well as for only the saved tensors in the file ( see Model Spec for a script to evaluate that). The following models are currently supported:
- SDXL-base-1.0
File Hash (sha256): 31e35c80fc4829d14f90153f4c74cd59c90b779f6afe05a74cd6120b893f7e5b Tensordata Hash (sha256): 0xd7a9105a900fd52748f20725fe52fe52b507fd36bee4fc107b1550a26e6ee1d7
- SDXL-refiner-1.0
File Hash (sha256): 7440042bbdc8a24813002c09b6b69b64dc90fded4472613437b7f55f9b7d9c5f Tensordata Hash (sha256): 0x1a77d21bebc4b4de78c474a90cb74dc0d2217caf4061971dbfa75ad406b75d81
- SDXL-base-0.9
- SDXL-refiner-0.9
- SD-2.1-512
- SD-2.1-768
Weights for SDXL:
SDXL-1.0: The weights of SDXL-1.0 are available (subject to a CreativeML Open RAIL++-M
license) here:
- base model: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/
- refiner model: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/
SDXL-0.9: The weights of SDXL-0.9 are available and subject to a research license. If you would like to access these models for your research, please apply using one of the following links: SDXL-base-0.9 model, and SDXL-refiner-0.9. This means that you can apply for any of the two links – and if you are granted – you can access both. Please log in to your Hugging Face Account with your organization email to request access.
After obtaining the weights, place them into checkpoints/
. Next, start the demo using
streamlit run scripts/demo/sampling.py --server.port <your_port>
Invisible Watermark Detection
Images generated with our code use the invisible-watermark library to embed an invisible watermark into the model output. We also provide a script to easily detect that watermark. Please note that this watermark is not the same as in previous Stable Diffusion 1.x/2.x versions.
To run the script you need to either have a working installation as above or try an experimental import using only a minimal amount of packages:python -m venv .detect source .detect/bin/activate pip install “numpy>=1.17” “PyWavelets>=1.1.1” “opencv-python>=4.1.0.25” pip install –no-deps invisible-watermark
To run the script you need to have a working installation as above. The script is then useable in the following ways (don’t forget to activate your virtual environment beforehand, e.g. source .pt1/bin/activate
):# test a single file python scripts/demo/detect.py <your filename here> # test multiple files at once python scripts/demo/detect.py <filename 1> <filename 2> … <filename n> # test all files in a specific folder python scripts/demo/detect.py <your folder name here>/*
Training:
We are providing example training configs in configs/example_training
. To launch a training, run
python main.py --base configs/<config1.yaml> configs/<config2.yaml>
where configs are merged from left to right (later configs overwrite the same values). This can be used to combine model, training and data configs. However, all of them can also be defined in a single config. For example, to run a class-conditional pixel-based diffusion model training on MNIST, runpython main.py –base configs/example_training/toy/mnist_cond.yaml
NOTE 1: Using the non-toy-dataset configs configs/example_training/imagenet-f8_cond.yaml
, configs/example_training/txt2img-clipl.yaml
and configs/example_training/txt2img-clipl-legacy-ucg-training.yaml
for training will require edits depending on the used dataset (which is expected to stored in tar-file in the webdataset-format). To find the parts which have to be adapted, search for comments containing USER:
in the respective config.
NOTE 2: This repository supports both pytorch1.13
and pytorch2
for training generative models. However for autoencoder training as e.g. in configs/example_training/autoencoder/kl-f4/imagenet-attnfree-logvar.yaml
, only pytorch1.13
is supported.
NOTE 3: Training latent generative models (as e.g. in configs/example_training/imagenet-f8_cond.yaml
) requires retrieving the checkpoint from Hugging Face and replacing the CKPT_PATH
placeholder in this line. The same is to be done for the provided text-to-image configs.
Building New Diffusion Models
Conditioner
The GeneralConditioner
is configured through the conditioner_config
. Its only attribute is emb_models
, a list of different embedders (all inherited from AbstractEmbModel
) that are used to condition the generative model. All embedders should define whether or not they are trainable (is_trainable
, default False
), a classifier-free guidance dropout rate is used (ucg_rate
, default 0
), and an input key (input_key
), for example, txt
for text-conditioning or cls
for class-conditioning. When computing conditionings, the embedder will get batch[input_key]
as input. We currently support two to four dimensional conditionings and conditionings of different embedders are concatenated appropriately. Note that the order of the embedders in the conditioner_config
is important.
Network
The neural network is set through the network_config
. This used to be called unet_config
, which is not general enough as we plan to experiment with transformer-based diffusion backbones.
Loss
The loss is configured through loss_config
. For standard diffusion model training, you will have to set sigma_sampler_config
.
Sampler config
As discussed above, the sampler is independent of the model. In the sampler_config
, we set the type of numerical solver, number of steps, type of discretization, as well as, for example, guidance wrappers for classifier-free guidance.
Dataset Handling
For large scale training we recommend using the data pipelines from our data pipelines project. The project is contained in the requirement and automatically included when following the steps from the Installation section. Small map-style datasets should be defined here in the repository (e.g., MNIST, CIFAR-10, …), and return a dict of data keys/values, e.g.,example = {“jpg”: x, # this is a tensor -1…1 chw “txt”: “a beautiful image”}
where we expect images in -1…1, channel-first format.