
Sana
Efficient high-resolution image synthesis framework
0
- -Deep compression autoencoder: Compared with traditional autoencoders, Sana trained autoencoders can compress images by 32 times, effectively reducing the number of potential labels.
- -Linear DiT: replaces all traditional attention mechanisms with linear attention, improving efficiency at high resolution without sacrificing quality.
- -Decoder only text encoder: Using modern decoder only small language models as text encoders, and enhancing image text alignment through complex human instructions and context learning.
- -Efficient training and sampling: Flow DPM Solver is proposed to reduce sampling steps and accelerate convergence through efficient title labeling and selection.
- -Competing with modern large-scale diffusion models: Sana-0.6B is comparable in performance to modern large-scale diffusion models such as Flux-12B, with a size 20 times smaller and throughput over 100 times faster.
- -Laptop GPU deployment: Sana-0.6B can be deployed on 16GB laptop GPU, generating 1024 × 1024 resolution images in less than 1 second.
- -Open source solutions: Sana is committed to providing fast and open-source AI technology to solve practical challenges.
Product Details
Sana is a text to image framework that efficiently generates images with resolutions up to 4096 × 4096. It synthesizes high-resolution, high-quality images at an extremely fast speed while maintaining strong text image alignment, and can be deployed on laptop GPUs. Sana's core design includes a deep compression autoencoder, a linear diffusion transformer (DiT), a small language model with only a decoder as the text encoder, and efficient training and sampling strategies. Compared to modern large-scale diffusion models, Sana-0.6B is 20 times smaller in size and has a measurement throughput that is over 100 times faster. In addition, Sana-0.6B can be deployed on 16GB laptop GPUs, generating 1024 × 1024 resolution images in less than 1 second. Sana makes low-cost content creation possible.