publications
* = equal contribution
2025
- arXivSelective Underfitting in Diffusion Models2025.
Diffusion models have emerged as the principal paradigm for generative modeling across various domains. During training, they learn the score function, which in turn is used to generate samples at inference. They raise a basic yet unsolved question: which score do they actually learn? In principle, a diffusion model that matches the empirical score in the entire data space would simply reproduce the training data, failing to generate novel samples. Recent work addresses this question by arguing that diffusion models underfit the empirical score due to training-time inductive biases. In this work, we refine this perspective, introducing the notion of selective underfitting: instead of underfitting the score everywhere, better diffusion models more accurately approximate the score in certain regions of input space, while underfitting it in others. We characterize these regions and design empirical interventions to validate our perspective. Our results establish that selective underfitting is essential for understanding diffusion models, yielding new, testable insights into their generalization and generative performance.
@misc{song2025selective, title = {Selective Underfitting in Diffusion Models}, author = {Song, Kiwhan and Kim, Jaeyeon and Chen, Sitan and Du, Yilun and Kakade, Sham and Sitzmann, Vincent}, year = {2025}, eprint = {2510.01378}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2510.01378}, } - ICMLHistory-Guided Video DiffusionIn the 42nd International Conference on Machine Learning, 2025.
Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos.
@misc{song2025historyguidedvideodiffusion, title = {History-Guided Video Diffusion}, author = {Song, Kiwhan and Chen, Boyuan and Simchowitz, Max and Du, Yilun and Tedrake, Russ and Sitzmann, Vincent}, year = {2025}, eprint = {2502.06764}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2502.06764}, }