SANA-WM: NVIDIA’s Open-Source 2.6B Parameter Video World Model

Introduction to SANA-WM
Architecture and Design Innovations
Training Strategy and Efficiency
Applications and Industry Implications
Conclusion and Future Directions

Introduction to SANA-WM

Architecture and Design Innovations

SANA-WM’s architecture addresses three core challenges in long-duration, high-resolution video synthesis: scaling to handle large token counts, managing accurate continuous camera control, and maintaining visual fidelity over minute-long sequences. The design includes four main innovations:

Hybrid Linear Attention: SANA-WM combines frame-wise Gated DeltaNet (GDN) blocks with periodic softmax attention. GDN is a neural network module that enables efficient recurrent aggregation of context within individual frames, allowing the model to remember relevant details without excessive memory usage. Softmax attention, on the other hand, provides exact long-range recall across frames, ensuring that the model can connect information throughout the entire video. This hybrid approach allows SANA-WM to process minute-scale 720p video sequences efficiently, balancing memory footprint with modeling power.
Dual-Branch Camera Control: Precise trajectory following is essential for realistic video synthesis. The model uses a dual-rate conditioning design. The latent-rate Unified Camera Positional Encoding (UCPE) branch captures the overall structure of the camera path, while the raw-frame Plücker mixing branch restores fine-grained camera motion within each temporal stride of the video’s latent representation. Plücker coordinates are a mathematical way to represent lines in 3D space, which helps the model understand and reconstruct intricate camera paths. This architecture preserves the fidelity of camera control even when videos are compressed aggressively.
Two-Stage Generation Pipeline: SANA-WM produces an initial rough video sequence in the first stage, followed by a dedicated refinement stage. The long-video refiner corrects structural artifacts, sharpens visual details, and improves temporal consistency across the entire one-minute output. This two-step process results in higher-quality and more coherent videos.
Robust Data Annotation Pipeline: Precise metric-scale 6-DoF camera pose annotations are critical for training. SANA-WM extracts these annotations from public videos using advanced pose and geometry estimation techniques. This pipeline produces over 213,000 clips with high-quality labels, enabling the model to learn accurate camera-conditioned dynamics.

Component	Design Feature	Purpose
Hybrid Linear Attention	GDN + Softmax Attention	Memory-efficient long-context modeling for minute-long video
Dual-Branch Camera Control	UCPE + Plücker Mixing	Precise 6-DoF trajectory adherence under compression
Two-Stage Pipeline	Initial Generation + Refinement	Improved visual quality and temporal consistency
Data Annotation	Metric-scale pose extraction	High-quality supervision for action-conditioned learning

For example, if a filmmaker wants to produce a continuous shot that follows a character through a dynamic scene, SANA-WM’s camera control and minute-scale synthesis allow the creator to specify an exact trajectory for the camera to follow, resulting in a smooth, cinematic output that matches the planned choreography.

Training Strategy and Efficiency

Training SANA-WM on minute-long 720p videos presented several scalability challenges. The team addressed this with a progressive training approach and architectural optimizations:

High-Compression Latents: SANA-WM uses LTX2-VAE, a variational autoencoder (VAE) that provides 2× to 8× better spatiotemporal compression compared to previous video VAEs. VAEs are neural networks that compress input data into a lower-dimensional latent space and reconstruct it. Greater compression means fewer tokens for the transformer to process, reducing memory and compute needs.
Progressive Sequence Scaling: Training begins with short clips and gradually increases sequence length to a full minute. This staged method helps the model stabilize and adapt as it learns to handle longer sequences, preventing the system from being overwhelmed by the complexity of full-length videos early in training.
Efficient Backbone: The hybrid GDN-softmax transformer backbone is carefully designed to balance efficiency with the ability to capture long-range dependencies, making it feasible to synthesize minute-scale video on current hardware.
Training Resources: The entire process completes in 15 days using 64 NVIDIA H100 GPUs, a relatively modest compute budget for a model of this scale and resolution. By comparison, other state-of-the-art models often require significantly more hardware and time.
Inference Modes: SANA-WM supports three inference variants: a bidirectional offline generator for highest quality, a chunk-causal autoregressive generator for sequential rollouts, and a distilled autoregressive generator optimized for rapid single-GPU deployment. For instance, the distilled model can denoise a one-minute, 720p video in under 35 seconds on a single RTX 5090 GPU using NVFP4 quantization, a technique that compresses neural network weights for faster inference.

The following practical example shows how to use SANA-WM for camera-conditioned video synthesis in Python. This example generates a one-minute video based on an initial frame and a sequence of camera positions and orientations:

import torch
from sana_wm import SANAWorldModel

# Load pretrained distilled SANA-WM model for fast inference
model = SANAWorldModel.from_pretrained("nvlabs/sana-wm-distilled")

# Input: First frame image tensor and 6-DoF camera trajectory (N frames x 6 params)
first_frame = torch.load("input_frame.pt") # shape: [3, 720, 1280]
camera_trajectory = torch.load("camera_trajectory.pt") # shape: [60, 6]

# Generate one-minute video (60 frames) conditioned on input
output_video = model.generate_video(first_frame, camera_trajectory, resolution=(720, 1280))

# Save or visualize output
torch.save(output_video, "generated_video.pt")

*Note: This code is a minimal example for example purposes. Production use should add error handling, batching, and further optimizations for deployment at scale.*

The efficiency of SANA-WM’s workflow means that teams can experiment with different camera paths and scene conditions rapidly, enabling faster iteration in research and creative projects.

Applications and Industry Implications

SANA-WM’s combination of high resolution, long duration, and precise camera control opens up new possibilities across a range of fields. Here are some practical examples:

Embodied AI and Robotics Simulation: Developers can use SANA-WM to generate realistic, interactive simulations with metric-scale camera trajectories. For instance, a robotics team could create virtual environments for navigation and control research, reducing the need for expensive physical hardware or proprietary simulators.
Virtual Production and Content Creation: Filmmakers and game developers benefit from the ability to generate immersive scenes with controllable camera paths. This accelerates content pipelines and minimizes reliance on manual animation or time-consuming 3D rendering, making creative iteration more accessible and less resource-intensive.
Research and Benchmarking: The model and its accompanying one-minute benchmark set a new standard for evaluating long-horizon video synthesis, action-following accuracy, and efficiency. For example, the benchmark includes scenes generated by Nano Banana Pro across a variety of environments, each paired with revisit trajectories that test the model’s ability to maintain consistency and realism over time. Researchers can use these benchmarks to compare new models against SANA-WM’s results.
Democratizing Access: By reducing computational barriers and making the code open source, SANA-WM allows smaller teams and academic labs to experiment with world modeling. This supports innovation beyond large industrial labs and encourages broader participation.

These developments reflect a broader pattern in the AI sector, where scalable and efficient models are increasingly valued. For example, advances in storage density, such as the recent Kioxia and Dell 9.8PB 2RU flash server, are addressing data processing bottlenecks that affect the training and deployment of models like SANA-WM. For more on data integrity and model security, see SQL Fraud Detection Patterns in 2026: Why They Still Matter.

Suppose a university lab wants to test navigation algorithms for a drone in a virtual city. By generating a variety of video scenes with different camera paths, the team can simulate complex environments and evaluate their algorithms’ robustness, all without physical flight tests.

Conclusion and Future Directions

NVIDIA’s SANA-WM is a significant advance in open-source video world modeling. It shows that high-resolution, minute-scale video synthesis with precise camera control can be done with accessible compute and data budgets. The hybrid attention architecture, dual-branch conditioning, and two-stage refinement pipeline provide a blueprint for future long-horizon video models.

Looking forward, several promising directions could extend SANA-WM’s capabilities:

Multi-View and Multi-Modal Generation: Adding support for multiple input views or integrating audio and text could improve scene understanding and the realism of generated videos.
Physics and Dynamics Modeling: Incorporating physical simulation will enable the generation of plausible object interactions and more realistic environmental dynamics within the generated scenes.
Real-Time Control: Optimizing inference for interactive, real-time applications, such as virtual reality or robotics control, will make SANA-WM applicable in scenarios that demand immediate feedback and adaptability.
Broader Action Spaces: Expanding control beyond camera trajectories to include object manipulation or agent behavior will open up new research and application domains.

NVIDIA’s open-source release and benchmark invite the research community to build on this foundation and continue advancing interactive, embodied AI and video synthesis. For more technical details or to access the code, visit the official project page and GitHub repository: https://nvlabs.github.io/Sana/WM/.

Key Takeaways:

SANA-WM is a highly efficient 2.6 billion parameter open-source video world model capable of generating minute-long, 720p camera-controlled videos.
Its hybrid linear attention and dual-branch camera control architecture allow for long-context modeling and precise 6-DoF trajectory adherence.
Progressive training on 213,000 public video clips and a two-stage refinement pipeline deliver competitive quality with accessible compute requirements.
Inference can be performed on a single GPU, making this technology practical for broader research and production scenarios.
Applications span embodied AI, virtual production, and content creation, with the open-source release supporting further innovation across the field.

Sources and References

This article was researched using a combination of primary and supplementary sources:

Supplementary References

These sources provide additional context, definitions, and background information to help clarify concepts mentioned in the primary source.