
Exploring the Latest Advancements in AI Research
Our community of open source research hubs has over 200,000 members building the future of AI. We are working globally with our partners, industry leaders, and experts to develop cutting-edge open AI models for Image, Language, Audio, Video, 3D, Biology and more.

MARBLE: Material Recomposition and Blending in CLIP-Space
Editing materials of objects in images based on exemplar images is an active area of research in computer vision and graphics. We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models.

Fast Text-to-Audio Generation with Adversarial Post-Training
We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation.

FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image
We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency.

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation
We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency.

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator.

Stable Virtual Camera: Multi-View Video Generation with 3D Camera Control
We present Stable Virtual Camera, a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras.

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions.

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds.

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency
We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation.

Stable Audio Open
Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics.

Shaping Realities: Enhancing 3D Generative AI with Fabrication Constraints
This workshop paper highlights the limitations of generative AI tools in translating digital creations into the physical world and proposes new augmentations to generative AI tools for creating physically viable 3D models.

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales.

TripoSR: Fast 3D Object Reconstruction from a Single Image
This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds.

Fast Timing-Conditioned Latent Audio Diffusion
Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model.

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Explore the latest research in image generation with the Hourglass Diffusion Transformer (HDiT). This paper presents a new approach in high-resolution image synthesis, setting itself apart by handling large-scale images more efficiently than traditional methods. It's an insightful read for those interested in the technical advancements of image generation, offering a deep dive into the complexities and innovations in this field.

Adversarial Diffusion Distillation
We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality.

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
We present Stable Video Diffusion — a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.

Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion
Stable Audio represents the cutting-edge audio generation research by Stability AI’s generative audio research lab, Harmonai.

Humans in 4D: Reconstructing and Tracking Humans with Transformers
Stability AI is proud to support research teams across the globe by providing compute power.