SIGGRAPH Asia 2026 Submission · Demo Page

PhysWonder: Physically Controllable Video Generation via Internalized Simulation‑Coupled Constitutive Representation

Anonymous Author(s)

Abstract

With the rapid progress of video generation models, researchers have begun to explore whether they can serve as simulators of the physical world. Existing methods usually rely on external simulators to generate explicit motion proxies, such as point clouds, particles, or trajectories, which are then projected into the video domain and refined by a video generator. Another line of work translates motion information into simplified or pixel-space control conditions to guide the video generator. Despite different pipelines, both paradigms ultimately rely on projected motion cues, which provide limited support for modeling material responses under changing interactions. Our key insight is that constitutive-state representations can bridge this gap by describing the material response that links external interactions to deformation and motion. We present PhysWonder, a unified framework for physically controllable video generation built around simulation-coupled constitutive-state representations. Our framework consists of PhysWonder-3D, which enables unified constitutive simulation and scalable physically grounded data synthesis through a decomposition-based hybrid Material Point Method, and PhysWonder-Ctrl, which uses a dual-branch video generation architecture to internalize deformation-gradient-based constitutive-state control. By treating deformation gradients as constitutive-aware supervision, joint training strengthens deformation-gradient control and learns constitutive-aware latent control without requiring explicit control videos during inference. Experiments show that PhysWonder achieves state-of-the-art synthesis efficiency and robustness among simulator-coupled baselines, and improves controllability, fidelity, and motion consistency under both explicit deformation-gradient control and control-free inference.

PhysWonder highlights overview
PhysWonder: Scalable Constitutive-State Representations for Video Generation. (a) Middle: PhysWonder addresses a fundamental limitation of existing video generators. They can synthesize plausible-looking motion, but often fail to capture how materials respond to interactions, such as remaining rigid, elastically deforming, or plastically yielding. We propose scalable simulation-coupled constitutive-state representations that link material properties, deformation, and motion, and show that they can be internalized by video generation models. (b) Left: Existing baselines often fail under diverse material responses, causing collapse, hallucination, rigidification, or incoherence. PhysWonder-3D improves robustness and efficiency through unified constitutive simulation and produces aligned videos with deformation-gradient representations. (c) Right: PhysWonder-Ctrl internalizes deformation-gradient controls as generator-adapted latents. Joint training preserves particle motion and material response, while the predictor enables inference without explicit control videos.
Interactive Demo
PhysWonder teaser SOTA figure
Click to explore the Insights behind each module!
PhysWonder-3D: Scalable Pipeline for Unified Constitutive Simulation and Data Synthesis

We introduce a decomposition-based hybrid MPM formulation, which forms the core of PhysWonder-3D, a unified constitutive simulation and scalable data synthesis pipeline. It unifies diverse material behaviors within a single particle-grid simulator, achieving the best efficiency and success rate while producing high-fidelity videos and unique deformation-gradient representations of particle-level motion and material response.

PhysWonder-3D framework figure
PhysWonder-3D: Scalable Constitutive-State Representation Synthesis. (a) Simulation-ready world building: Assets or images are converted into executable physical scenes via perception, geometry recovery, and unified constitutive simulation. (b) Scalable data synthesis: VLM-guided annotation and physics-aligned rendering produce aligned videos and constitutive-state representations of particle motion, deformation gradients, and material response.
Asset
PhysGen3D CVPR-25
WonderPlay ICCV-25
PerpetualWonder CVPR-26
RealWonder Current SOTA
PhysWonder-3D Ours
Apple
Elastic (E = 1e5)
Apple - PhysGen3D
Apple - WonderPlay
Apple - PerpetualWonder
Apple - RealWonder
Apple - PhysWonder-3DElastic
Ball
Elastic (E = 6e5)
Simulation failed
Ball - WonderPlay
Ball - PerpetualWonder
Ball - RealWonder
Ball - PhysWonder-3DElastic
Bear
Elastic (E = 3e4)
Bear - PhysGen3D
Bear - WonderPlay
Bear - PerpetualWonder
Bear - RealWonder
Bear - PhysWonder-3DElastic
Bottle
Rigid
Simulation failed
Bottle - WonderPlay
Bottle - PerpetualWonder
Bottle - RealWonder
Bottle - PhysWonder-3DRigid
Mug
Rigid
Simulation failed
Mug - WonderPlay
Simulation failed
Mug - RealWonder
Mug - PhysWonder-3DRigid
Table
Rigid
Table - PhysGen3D
Table - WonderPlay
Simulation failed
Table - RealWonder
Table - PhysWonder-3DRigid
Bread
Sand (E = 1e4)
Simulation failed
Bread - WonderPlay
Bread - PerpetualWonder
Bread - RealWonder
Bread - PhysWonder-3DSand
Can
Elastic (E = 5e6)
Can - PhysGen3D
Can - WonderPlay
Can - PerpetualWonder
Can - RealWonder
Can - PhysWonder-3DElastic
Dragon
Sand (E = 5e4)
Dragon - PhysGen3D
Dragon - WonderPlay
Dragon - PerpetualWonder
Dragon - RealWonder
Dragon - PhysWonder-3DSand
Rabbit
Snow (E = 1e5)
Rabbit - PhysGen3D
Rabbit - WonderPlay
Rabbit - PerpetualWonder
Rabbit - RealWonder
Rabbit - PhysWonder-3DSnow
Shoe
Elastic (E = 5e4)
Shoe - PhysGen3D
Shoe - WonderPlay
Shoe - PerpetualWonder
Shoe - RealWonder
Shoe - PhysWonder-3DElastic
Toy
Elastic (E = 6.5e4)
Toy - PhysGen3D
Toy - WonderPlay
Simulation failed
Toy - RealWonder
Toy - PhysWonder-3DElastic
PhysWonder-3D efficiency and robustness. We report average runtime (seconds) per material group and total successful runs over 12 representative scenes.
Method Elastic ↓ Rigid ↓ Sand ↓ Snow ↓ Succ. ↑
PhysGen3D 490.6 -- 369.7 402.4 7/12
WonderPlay 579.5 599.8 598.2 621.2 4/12
PerpetualWonder 6755.6 11972.9 6745.6 9355.4 4/12
RealWonder 312.5 344.8 282.8 299.9 8/12
PhysWonder-3D 182.8 197.6 155.7 162.9 12/12
PhysWonder-Ctrl: Dual-Branch Modeling with Internalized Constitutive Representation

We propose PhysWonder-Ctrl, a joint modeling framework for video generation with internalized constitutive control. We leverage deformation gradients as a distinctive simulation-coupled constitutive representation, jointly optimizing the control-latent predictor and video generator for material-response-aware generation. Experiments demonstrate strong physically controllable video generation and support deformation gradients as internalizable constitutive representations, while the learned predictor enables inference without explicit control videos.

PhysWonder-Ctrl framework figure
Stage 2
PhysWonder-Ctrl: Internalizing Constitutive Latents. (a) Initialization: PhysWonder-3D produces prompts, aligned videos, and deformation-gradient controls from reference images and physical parameters. (b) Training: Joint modeling turns deformation-gradient representations into generator-adapted control latents. (c) Inference: The predictor provides internalized latents for material-aware generation and control-free inference. Click the 🔎 Control Latent Predictor Framework button on the figure to view details.

Same image. Same motion. Different material response.

"An elastic banana with Young's modulus E=400000 starts on the ground, initially moving to the right, causing deformation."

Elastic banana - PhysWonder-3D
PhysWonder-3D GT
Elastic banana - PhysWonder-Ctrl
PhysWonder-Ctrl
Elastic banana - Wan2.1-TI2V-1.3B
Wan2.1-TI2V-1.3B Fine-Tuning (Prompt)
Elastic banana - ForcePrompting
ForcePrompting
Elastic banana - Seedance 2.0
Seedance 2.0
Elastic banana - VideoREPA
VideoREPA

"A rigid banana starts on the ground, initially moving to the right, without visible deformation."

Rigid banana - PhysWonder-3D
PhysWonder-3D GT
Rigid banana - PhysWonder-Ctrl
PhysWonder-Ctrl
Rigid banana - Wan2.1-TI2V-1.3B
Wan2.1-TI2V-1.3B Fine-Tuning (Prompt)
Rigid banana - ForcePrompting
ForcePrompting
Rigid banana - Seedance 2.0
Seedance 2.0
Rigid banana - VideoREPA
VideoREPA

"A sand chocolate with Young's modulus E=10000 starts on the ground, initially moving to the right, causing deformation."

Sand chocolate - PhysWonder-3D
PhysWonder-3D GT
Sand chocolate - PhysWonder-Ctrl
PhysWonder-Ctrl
Sand chocolate - Wan2.1-TI2V-1.3B
Wan2.1-TI2V-1.3B Fine-Tuning (Prompt)
Sand chocolate - ForcePrompting
ForcePrompting
Sand chocolate - Seedance 2.0
Seedance 2.0
Sand chocolate - VideoREPA
VideoREPA

"A snow chocolate with Young's modulus E=40000 starts on the ground, initially moving to the right, causing deformation."

Snow chocolate - PhysWonder-3D
PhysWonder-3D GT
Snow chocolate - PhysWonder-Ctrl
PhysWonder-Ctrl
Snow chocolate - Wan2.1-TI2V-1.3B
Wan2.1-TI2V-1.3B Fine-Tuning (Prompt)
Snow chocolate - ForcePrompting
ForcePrompting
Snow chocolate - Seedance 2.0
Seedance 2.0
Snow chocolate - VideoREPA
VideoREPA
Quantitative comparison of PhysWonder-Ctrl with representative video generation models on engine-in-the-loop simulated scenes. All methods use the same reference images and prompts; fine-tuned methods use the same training scale. We report visual quality and motion consistency. Evaluation is on 82 test assets across four constitutive simulations, using PhysWonder-3D as the engine-in-the-loop reference.
Category Method PSNR ↑ SSIM ↑ LPIPS ↓ SimFlow ↑
Physics-Guided Models Force-Prompting-5B + Perfect Prompt Only 9.99 0.0899 0.8832 0.0189
VideoREPA-5B + Perfect Prompt Only 23.38 0.6345 0.2626 0.0675
Foundation Models (w/o Fine-Tuning) Wan2.1-TI2V-1.3B + Perfect Prompt Only 20.41 0.5345 0.5642 0.0432
Wan2.2-TI2V-14B + Perfect Prompt Only 20.42 0.5351 0.5645 0.0423
Wan2.1-TI2V-1.3B + GT Deformation-Gradient Control 23.82 0.6167 0.2659 0.3165
Wan2.2-TI2V-14B + GT Deformation-Gradient Control 23.66 0.6179 0.3124 0.2985
Ablation (Control Fine-Tuning) Wan2.1-TI2V-1.3B + Perfect Prompt Only 20.35 0.5274 0.6088 0.0182
Wan2.1-TI2V-1.3B + GT Trajectory Control 23.59 0.6303 0.2484 0.2756
Wan2.1-TI2V-1.3B + GT Optical Flow Control 20.56 0.4795 0.6645 0.2307
Wan2.1-TI2V-1.3B + GT Deformation-Gradient Control 24.24 0.6463 0.2314 0.3398
Ours (Joint Modeling) PhysWonder-Ctrl-1.3B + GT Deformation-Gradient Control 24.50 0.6513 0.2301 0.3575
PhysWonder-Ctrl-1.3B + Latent Predictor w/o GT Deformation-Gradient Control 23.77 0.6359 0.2485 0.3410

Deformation-Gradients provide more effective constitutive control than motion-level signals

All control-representation comparisons use the same training-data scale. Trajectory controls and deformation-gradient controls are directly obtained from PhysWonder-3D simulation with exactly the same particle count, while optical-flow controls are extracted from the rendered ground-truth videos. As a result, trajectory and optical flow serves as an upper-bound reference for motion-level control methods such as PhysCtrl and VideoJAM.

GT
PhysWonder-3D Engine-in-the-loop
Alcohol - GT
Control Signal
Prediction Result
Deformation
Gradient
Deformation-Gradient Control
PhysWonder-Ctrl GT Deformation-Gradient Control
Trajectory
Trajectory Control
Wan2.1-TI2V-1.3B GT Trajectory Control
Optical
Flow
Optical Flow Control
Wan2.1-TI2V-1.3B GT Optical Flow Control
GT
PhysWonder-3D Engine-in-the-loop
Bagel - GT
Control Signal
Prediction Result
Deformation
Gradient
Deformation-Gradient Control
PhysWonder-Ctrl GT Deformation-Gradient Control
Trajectory
Trajectory Control
Wan2.1-TI2V-1.3B GT Trajectory Control
Optical
Flow
Optical Flow Control
Wan2.1-TI2V-1.3B GT Optical Flow Control
GT
PhysWonder-3D Engine-in-the-loop
Banana - GT
Control Signal
Prediction Result
Deformation
Gradient
Deformation-Gradient Control
PhysWonder-Ctrl GT Deformation-Gradient Control
Trajectory
Trajectory Control
Wan2.1-TI2V-1.3B GT Trajectory Control
Optical
Flow
Optical Flow Control
Wan2.1-TI2V-1.3B GT Optical Flow Control
GT
PhysWonder-3D Engine-in-the-loop
Chocolate Bar - GT
Control Signal
Prediction Result
Deformation
Gradient
Deformation-Gradient Control
PhysWonder-Ctrl GT Deformation-Gradient Control
Trajectory
Trajectory Control
Wan2.1-TI2V-1.3B GT Trajectory Control
Optical
Flow
Optical Flow Control
Wan2.1-TI2V-1.3B GT Optical Flow Control

Can the Control Latent Predictor Learn Material-Response Aware Controls?

Using the same backbone and training data, we compare PhysWonder-Ctrl with a prompt-only fine-tuned baseline. At inference time, both methods rely only on the reference image and text prompt, without explicit deformation-gradient videos. PhysWonder-Ctrl produces more material-aware dynamics, including shape-preserving rigid motion, visible elastic deformation, and plastic collapse or spreading, suggesting that it learns deformation-informed latent controls beyond simple prompt tuning.

"A Sand water cup with Young's modulus E=4000 starts on the ground, initially moving to the right, causing deformation."

Control Latent Predictor
Prompt Only
Prompt Only

"A Snow alcohol with Young's modulus E=1000 starts on the ground, initially moving to the right, causing deformation."

Control Latent Predictor
Prompt Only
Prompt Only

"An Elastic banana with Young's modulus E=10000 starts on the ground, initially moving to the top, causing deformation."

Control Latent Predictor
Prompt Only
Prompt Only

"A Rigid donut starts on the ground, initially moving to the right, without visible deformation."

Control Latent Predictor
Prompt Only
Prompt Only
Reference-free comparison of material-response plausibility. We include the closed-source model under limited access, using reference-free metrics and multi-VLM pairwise wins.
Metric Seedance 2.0 Wan2.1-TI2V-1.3B PhysWonder-3D PhysWonder-Ctrl
GT Deformation-Gradient
PhysWonder-Ctrl
Predicted Control Latent
DiffSSIM ↑ 0.38 0.00 0.33 0.29 0.21
DiffCLIP ↑ 1.02 1.00 1.11 1.06 1.04
VLM Wins ↑ 1.6/4 0.1/4 4.0/4 3.0/4 1.3/4
Conclusion, Limitations and Future Work

We presented PhysWonder, a framework that connects constitutive simulation with physically controllable video generation. PhysWonder-3D provides efficient engine-in-the-loop synthesis across diverse materials, while PhysWonder-Ctrl internalizes deformation-gradient-based control for material-aware dynamic generation. Within our planned computational budget, experiments validate the proposed constitutive-state representation and joint modeling framework, demonstrating material-response control beyond motion-level guidance. PhysWonder remains partly limited by pretrained upstream modules, whose errors may affect world building and output quality. Future work includes scaling to larger generative models and developing engine-in-the-loop evaluation into a more objective protocol for physical consistency assessment. More broadly, our physically aligned synthesis pipeline may support deformable manipulation, flexible grasping, and material-aware planning.