PhysWonder Demo Page

Abstract

With the rapid progress of video generation models, researchers have begun to explore whether they can serve as simulators of the physical world. Existing methods usually rely on external simulators to generate explicit motion proxies, such as point clouds, particles, or trajectories, which are then projected into the video domain and refined by a video generator. Another line of work translates motion information into simplified or pixel-space control conditions to guide the video generator. Despite different pipelines, both paradigms ultimately rely on projected motion cues, which provide limited support for modeling material responses under changing interactions. Our key insight is that constitutive-state representations can bridge this gap by describing the material response that links external interactions to deformation and motion. We present PhysWonder, a unified framework for physically controllable video generation built around simulation-coupled constitutive-state representations. Our framework consists of PhysWonder-3D, which enables unified constitutive simulation and scalable physically grounded data synthesis through a decomposition-based hybrid Material Point Method, and PhysWonder-Ctrl, which uses a dual-branch video generation architecture to internalize deformation-gradient-based constitutive-state control. By treating deformation gradients as constitutive-aware supervision, joint training strengthens deformation-gradient control and learns constitutive-aware latent control without requiring explicit control videos during inference. Experiments show that PhysWonder achieves state-of-the-art synthesis efficiency and robustness among simulator-coupled baselines, and improves controllability, fidelity, and motion consistency under both explicit deformation-gradient control and control-free inference.

PhysWonder-3D: Scalable Pipeline for Unified Constitutive Simulation and Data Synthesis

We introduce a decomposition-based hybrid MPM formulation, which forms the core of PhysWonder-3D, a unified constitutive simulation and scalable data synthesis pipeline. It unifies diverse material behaviors within a single particle-grid simulator, achieving the best efficiency and success rate while producing high-fidelity videos and unique deformation-gradient representations of particle-level motion and material response.

PhysWonder-3D framework figure — **PhysWonder-3D: Scalable Constitutive-State Representation Synthesis.** (a) Simulation-ready world building: Assets or images are converted into executable physical scenes via perception, geometry recovery, and unified constitutive simulation. (b) Scalable data synthesis: VLM-guided annotation and physics-aligned rendering produce aligned videos and constitutive-state representations of particle motion, deformation gradients, and material response.

Asset

PhysGen3D CVPR-25

WonderPlay ICCV-25

PerpetualWonder CVPR-26

RealWonder Current SOTA

PhysWonder-3D Ours

Apple
Elastic (E = 1e5)

Elastic

Ball
Elastic (E = 6e5)

Simulation failed

Elastic

Bear
Elastic (E = 3e4)

Elastic

Bottle
Rigid

Simulation failed

Rigid

Mug
Rigid

Simulation failed

Simulation failed

Rigid

Table
Rigid

Simulation failed

Rigid

Bread
Sand (E = 1e4)

Simulation failed

Sand

Can
Elastic (E = 5e6)

Elastic

Dragon
Sand (E = 5e4)

Sand

Rabbit
Snow (E = 1e5)

Snow

Shoe
Elastic (E = 5e4)

Elastic

Toy
Elastic (E = 6.5e4)

Simulation failed

Elastic

**PhysWonder-3D efficiency and robustness.** We report average runtime (seconds) per material group and total successful runs over 12 representative scenes.
Method	Elastic ↓	Rigid ↓	Sand ↓	Snow ↓	Succ. ↑
PhysGen3D	490.6	--	369.7	402.4	7/12
WonderPlay	579.5	599.8	598.2	621.2	4/12
PerpetualWonder	6755.6	11972.9	6745.6	9355.4	4/12
RealWonder	312.5	344.8	282.8	299.9	8/12
PhysWonder-3D	182.8	197.6	155.7	162.9	12/12

PhysWonder-Ctrl: Dual-Branch Modeling with Internalized Constitutive Representation

We propose PhysWonder-Ctrl, a joint modeling framework for video generation with internalized constitutive control. We leverage deformation gradients as a distinctive simulation-coupled constitutive representation, jointly optimizing the control-latent predictor and video generator for material-response-aware generation. Experiments demonstrate strong physically controllable video generation and support deformation gradients as internalizable constitutive representations, while the learned predictor enables inference without explicit control videos.

PhysWonder-Ctrl framework figure — **PhysWonder-Ctrl: Internalizing Constitutive Latents.** (a) Initialization: PhysWonder-3D produces prompts, aligned videos, and deformation-gradient controls from reference images and physical parameters. (b) Training: Joint modeling turns deformation-gradient representations into generator-adapted control latents. (c) Inference: The predictor provides internalized latents for material-aware generation and control-free inference. **Click the 🔎 Control Latent Predictor Framework button on the figure to view details.**

Same image. Same motion. Different material response.

"An elastic banana with Young's modulus E=400000 starts on the ground, initially moving to the right, causing deformation."

Elastic banana - PhysWonder-3D — PhysWonder-3D GT

Elastic banana - PhysWonder-Ctrl — PhysWonder-Ctrl

Elastic banana - Wan2.1-TI2V-1.3B — Wan2.1-TI2V-1.3B Fine-Tuning (Prompt)

Elastic banana - ForcePrompting — ForcePrompting

Elastic banana - Seedance 2.0 — Seedance 2.0

"A rigid banana starts on the ground, initially moving to the right, without visible deformation."

Rigid banana - PhysWonder-3D — PhysWonder-3D GT

Rigid banana - PhysWonder-Ctrl — PhysWonder-Ctrl

Rigid banana - Wan2.1-TI2V-1.3B — Wan2.1-TI2V-1.3B Fine-Tuning (Prompt)

Rigid banana - ForcePrompting — ForcePrompting

Rigid banana - Seedance 2.0 — Seedance 2.0

"A sand chocolate with Young's modulus E=10000 starts on the ground, initially moving to the right, causing deformation."

Sand chocolate - PhysWonder-3D — PhysWonder-3D GT

Sand chocolate - PhysWonder-Ctrl — PhysWonder-Ctrl

Sand chocolate - Wan2.1-TI2V-1.3B — Wan2.1-TI2V-1.3B Fine-Tuning (Prompt)

Sand chocolate - ForcePrompting — ForcePrompting

Sand chocolate - Seedance 2.0 — Seedance 2.0

"A snow chocolate with Young's modulus E=40000 starts on the ground, initially moving to the right, causing deformation."

Snow chocolate - PhysWonder-3D — PhysWonder-3D GT

Snow chocolate - PhysWonder-Ctrl — PhysWonder-Ctrl

Snow chocolate - Wan2.1-TI2V-1.3B — Wan2.1-TI2V-1.3B Fine-Tuning (Prompt)

Snow chocolate - ForcePrompting — ForcePrompting

Snow chocolate - Seedance 2.0 — Seedance 2.0

**Quantitative comparison of PhysWonder-Ctrl with representative video generation models on engine-in-the-loop simulated scenes.** All methods use the same reference images and prompts; fine-tuned methods use the same training scale. We report visual quality and motion consistency. Evaluation is on 82 test assets across four constitutive simulations, using PhysWonder-3D as the engine-in-the-loop reference.
Category	Method	PSNR ↑	SSIM ↑	LPIPS ↓	Sim_Flow ↑
Physics-Guided Models	Force-Prompting-5B + Perfect Prompt Only	9.99	0.0899	0.8832	0.0189
Physics-Guided Models	VideoREPA-5B + Perfect Prompt Only	23.38	0.6345	0.2626	0.0675
Foundation Models (w/o Fine-Tuning)	Wan2.1-TI2V-1.3B + Perfect Prompt Only	20.41	0.5345	0.5642	0.0432
	Wan2.2-TI2V-14B + Perfect Prompt Only	20.42	0.5351	0.5645	0.0423
	Wan2.1-TI2V-1.3B + GT Deformation-Gradient Control	23.82	0.6167	0.2659	0.3165
	Wan2.2-TI2V-14B + GT Deformation-Gradient Control	23.66	0.6179	0.3124	0.2985
Ablation (Control Fine-Tuning)	Wan2.1-TI2V-1.3B + Perfect Prompt Only	20.35	0.5274	0.6088	0.0182
	Wan2.1-TI2V-1.3B + GT Trajectory Control	23.59	0.6303	0.2484	0.2756
	Wan2.1-TI2V-1.3B + GT Optical Flow Control	20.56	0.4795	0.6645	0.2307
	Wan2.1-TI2V-1.3B + GT Deformation-Gradient Control	24.24	0.6463	0.2314	0.3398
Ours (Joint Modeling)	PhysWonder-Ctrl-1.3B + GT Deformation-Gradient Control	24.50	0.6513	0.2301	0.3575
Ours (Joint Modeling)	PhysWonder-Ctrl-1.3B + Latent Predictor w/o GT Deformation-Gradient Control	23.77	0.6359	0.2485	0.3410

Deformation-Gradients provide more effective constitutive control than motion-level signals

All control-representation comparisons use the same training-data scale. Trajectory controls and deformation-gradient controls are directly obtained from PhysWonder-3D simulation with exactly the same particle count, while optical-flow controls are extracted from the rendered ground-truth videos. As a result, trajectory and optical flow serves as an upper-bound reference for motion-level control methods such as PhysCtrl and VideoJAM.

GT
PhysWonder-3D Engine-in-the-loop

Alcohol - GT

Control Signal

Prediction Result

Deformation
Gradient

Deformation-Gradient Control

PhysWonder-Ctrl GT Deformation-Gradient Control

Trajectory

Trajectory Control

Wan2.1-TI2V-1.3B GT Trajectory Control

Optical
Flow

Optical Flow Control

Wan2.1-TI2V-1.3B GT Optical Flow Control

GT
PhysWonder-3D Engine-in-the-loop

Bagel - GT

Control Signal

Prediction Result

Deformation
Gradient

Deformation-Gradient Control

PhysWonder-Ctrl GT Deformation-Gradient Control

Trajectory

Trajectory Control

Wan2.1-TI2V-1.3B GT Trajectory Control

Optical
Flow

Optical Flow Control

Wan2.1-TI2V-1.3B GT Optical Flow Control

GT
PhysWonder-3D Engine-in-the-loop

Banana - GT

Control Signal

Prediction Result

Deformation
Gradient

Deformation-Gradient Control

PhysWonder-Ctrl GT Deformation-Gradient Control

Trajectory

Trajectory Control

Wan2.1-TI2V-1.3B GT Trajectory Control

Optical
Flow

Optical Flow Control

Wan2.1-TI2V-1.3B GT Optical Flow Control

GT
PhysWonder-3D Engine-in-the-loop

Chocolate Bar - GT

Control Signal

Prediction Result

Deformation
Gradient

Deformation-Gradient Control

PhysWonder-Ctrl GT Deformation-Gradient Control

Trajectory

Trajectory Control

Wan2.1-TI2V-1.3B GT Trajectory Control

Optical
Flow

Optical Flow Control

Wan2.1-TI2V-1.3B GT Optical Flow Control

Can the Control Latent Predictor Learn Material-Response Aware Controls?

Using the same backbone and training data, we compare PhysWonder-Ctrl with a prompt-only fine-tuned baseline. At inference time, both methods rely only on the reference image and text prompt, without explicit deformation-gradient videos. PhysWonder-Ctrl produces more material-aware dynamics, including shape-preserving rigid motion, visible elastic deformation, and plastic collapse or spreading, suggesting that it learns deformation-informed latent controls beyond simple prompt tuning.

"A Sand water cup with Young's modulus E=4000 starts on the ground, initially moving to the right, causing deformation."

Control Latent Predictor
Prompt Only

Prompt Only

"A Snow alcohol with Young's modulus E=1000 starts on the ground, initially moving to the right, causing deformation."

Control Latent Predictor
Prompt Only

Prompt Only

"An Elastic banana with Young's modulus E=10000 starts on the ground, initially moving to the top, causing deformation."

Control Latent Predictor
Prompt Only

Prompt Only

"A Rigid donut starts on the ground, initially moving to the right, without visible deformation."

Control Latent Predictor
Prompt Only

Prompt Only

**Reference-free comparison of material-response plausibility.** We include the closed-source model under limited access, using reference-free metrics and multi-VLM pairwise wins.
Metric	Seedance 2.0	Wan2.1-TI2V-1.3B	PhysWonder-3D	PhysWonder-Ctrl GT Deformation-Gradient	PhysWonder-Ctrl Predicted Control Latent
Diff_SSIM ↑	0.38	0.00	0.33	0.29	0.21
Diff_CLIP ↑	1.02	1.00	1.11	1.06	1.04
VLM Wins ↑	1.6/4	0.1/4	4.0/4	3.0/4	1.3/4

Conclusion, Limitations and Future Work

We presented PhysWonder, a framework that connects constitutive simulation with physically controllable video generation. PhysWonder-3D provides efficient engine-in-the-loop synthesis across diverse materials, while PhysWonder-Ctrl internalizes deformation-gradient-based control for material-aware dynamic generation. Within our planned computational budget, experiments validate the proposed constitutive-state representation and joint modeling framework, demonstrating material-response control beyond motion-level guidance. PhysWonder remains partly limited by pretrained upstream modules, whose errors may affect world building and output quality. Future work includes scaling to larger generative models and developing engine-in-the-loop evaluation into a more objective protocol for physical consistency assessment. More broadly, our physically aligned synthesis pipeline may support deformable manipulation, flexible grasping, and material-aware planning.