SANTA CLARA — NVIDIA released its open omni-model NVIDIA Cosmos 3 for physical AI reasoning and action on June 1, 2026. The model is available on Hugging Face and integrates with the Hugging Face Diffusers library through the Cosmos3OmniPipeline.
NVIDIA Cosmos 3 combines world generation, physical reasoning, and action generation in one unified architecture. It processes multiple modalities—text, image, video, audio, and action—within a single system built on a Mixture-of-Transformers (MoT) architecture. The model can generate realistic and physically plausible video worlds from text, images, videos, or action inputs, and it supports tasks such as text-to-video, text-to-action, image-to-video-and-action, and action-to-video.
The system uses dedicated encoders for each input type: a Vision Transformer (ViT) for visual understanding, a Variational Autoencoder (VAE) for visual and audio generation, and domain-aware vectors for actions. Within the model, input sequences are split into an autoregressive (AR) subsequence for reasoning and a diffusion (DM) subsequence for generation. AR and DM tokens use separate parameter sets in each transformer layer but interact through joint attention mechanisms.
NVIDIA Cosmos 3 can reason about physical properties including motion, causality, and spatial relationships, and it predicts future video and action sequences based on the current state. For optimal results, NVIDIA recommends using detailed narrative paragraphs for video generation prompts and concise, spatially referenced prompts for action generation.
Two model sizes are available. NVIDIA Cosmos 3 Nano is an 8-billion-parameter model with an 8B reasoner and an 8B generator, optimized for efficient inference on workstation-grade hardware like the RTX PRO 6000 GPU. NVIDIA Cosmos 3 Super is a 32-billion-parameter model with a 32B reasoner and a 32B generator, designed for large-scale synthetic data generation and research, and runs on NVIDIA Hopper and Blackwell GPUs. Both models are accessible on Hugging Face under nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super, respectively.
Alongside the model release, NVIDIA published open synthetic data generation (SDG) datasets for physical AI, created by various NVIDIA teams and made available on Hugging Face. Post-training scripts for custom data training are also accessible on GitHub.
No independent assessment was available for this report.