Savannah Martin 6/5/25 Savannah Martin 6/5/25

MARBLE: Material Recomposition and Blending in CLIP-Space

Editing materials of objects in images based on exemplar images is an active area of research in computer vision and graphics. We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models.

Savannah Martin 5/13/25 Savannah Martin 5/13/25

Fast Text-to-Audio Generation with Adversarial Post-Training

We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation.

Savannah Martin 4/21/25 Savannah Martin 4/21/25

FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency.

Savannah Martin 3/25/25 Savannah Martin 3/25/25

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency.

Savannah Martin 3/18/25 Savannah Martin 3/18/25

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator.

Savannah Martin 3/18/25 Savannah Martin 3/18/25

Stable Virtual Camera: Multi-View Video Generation with 3D Camera Control

We present Stable Virtual Camera, a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras.

Savannah Martin 1/8/25 Savannah Martin 1/8/25

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions.

Savannah Martin 8/1/24 Savannah Martin 8/1/24

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds.

Savannah Martin 7/24/24 Savannah Martin 7/24/24

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation.

Savannah Martin 7/19/24 Savannah Martin 7/19/24

Stable Audio Open

Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics.

Savannah Martin 4/15/24 Savannah Martin 4/15/24

Shaping Realities: Enhancing 3D Generative AI with Fabrication Constraints

This workshop paper highlights the limitations of generative AI tools in translating digital creations into the physical world and proposes new augmentations to generative AI tools for creating physically viable 3D models.

Savannah Martin 3/18/24 Savannah Martin 3/18/24

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object.

Savannah Martin 3/5/24 Savannah Martin 3/5/24

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales.

Savannah Martin 3/4/24 Savannah Martin 3/4/24

TripoSR: Fast 3D Object Reconstruction from a Single Image

This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds.

Savannah Martin 2/7/24 Savannah Martin 2/7/24

Fast Timing-Conditioned Latent Audio Diffusion

Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model.

Kesh Bhamidipaty 1/23/24 Kesh Bhamidipaty 1/23/24

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Explore the latest research in image generation with the Hourglass Diffusion Transformer (HDiT). This paper presents a new approach in high-resolution image synthesis, setting itself apart by handling large-scale images more efficiently than traditional methods. It's an insightful read for those interested in the technical advancements of image generation, offering a deep dive into the complexities and innovations in this field.

Joshua Lopez 11/28/23 Joshua Lopez 11/28/23

Adversarial Diffusion Distillation

We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality.