WoW
Harnessing Intuitive Physics from a Scalable Embodied World Model
Anonymous ICLR Submission
Paper ID: [Under Review]
Abstract.

WoW world model generates high-quality, physically consistent robot action videos in Out-of-Distribution (OOD) scenarios, enabling closed-loop corrections and real-world robotic execution. The illustration shows the model's strong generalization across diverse tasks and environments.

World models have recently emerged as a powerful paradigm for robotics, integrating perception, action, and reasoning. However, most existing world models focus on narrow domains or rely heavily on simulation, limiting their ability to generalize to real-world embodied interaction. We introduce WoW (Harnessing Intuitive Physics from a Scalable Embodied World Model), a large-scale embodied world model designed to capture intuitive physics and enable scalable robot learning. WoW unifies multimodal inputs from diverse real and simulated datasets, and leverages embodied interaction to learn dynamics that transfer robustly across environments and robot platforms. Across extensive experiments, WoW demonstrates strong generalization in manipulation and long-horizon planning, enabling closed-loop control and real-world deployment. Our results highlight that embodied interaction and large-scale training are critical to building world-omniscient models for robotics.

Physics Understanding.
Robot Manipulation.
Real-world Scenarios.
Object Interactions and Dynamics.

Object Dynamics: A diverse collection of cube interactions showcasing different physical behaviors, collision dynamics, and environmental responses.

Artistic Interaction.
Approach.

Model Overview

(a) Inference: a latent diffusion transformer predicts future frames from image observations and text-based action descriptions. (b) Training: DINO features supervise intermediate DiT representations via a token relation distillation loss to improve spatial-temporal modeling.

Experimental Results.

Scaling Law Comparison at Varying Data Scales. Performance comparison across four different dataset sizes (30k, 200k, 600k, 2M samples) on multiple benchmarks, demonstrating clear improvements with increased training data following established neural scaling laws.

Comparative Analysis Across Video Generation Models. Comprehensive evaluation of different models including CogVideoX, Cosmos-Predict, Wan2.1, and our proposed WoW across multiple metrics: Video Quality (VQ), Instruction Understanding (IU), Physical Law (PL), and Planning. Results show both human evaluation and autonomous evaluation scores, with WoW achieving superior performance across most dimensions.

BibTeX
@article{anonymous2025wow,
        author    = {Anonymous},
        title     = {WoW: Harnessing Intuitive Physics from a Scalable Embodied World Model},
        journal   = {ICLR 2026 Submission},
        year      = {2026},
        note      = {Under Review}
      }