NVIDIA Nemotron 3 Ultra: Inside the Fastest Open-Weight AI Model Built in the US

NVIDIA's 550B-parameter Nemotron 3 Ultra is now the top US open-weight model. Here's what its hybrid Mamba-Transformer design means for builders.

By Hadidiz Flow Team • June 19, 2026 • AI

A New Open-Weight Leader Arrives From NVIDIA

On June 4, 2026, at Computex, NVIDIA released Nemotron 3 Ultra — a 550-billion-parameter open-weight language model the company is calling the most intelligent open-weight model built in the US. It's a meaningful marker for the open-source AI race: a US-trained, fully open model that can credibly compete with the best Chinese open-weight systems on raw capability, while running considerably faster.

For builders and agencies evaluating which models to put into production, Nemotron 3 Ultra is worth a close look — not just for its benchmark scores, but for the architecture decisions behind it.

Inside the Architecture: Mamba Meets Mixture-of-Experts

Nemotron 3 Ultra uses a hybrid design that blends Mamba-2 layers with traditional Transformer attention layers, wrapped in a Mixture-of-Experts (MoE) structure. Of its 550 billion total parameters, only about 55 billion are active per token — which is what keeps inference fast despite the model's size.

The Mamba-2 layers do most of the heavy lifting on long sequences. Unlike standard attention, which gets quadratically more expensive as context grows, Mamba layers scale far more efficiently. That's the key technical reason Nemotron 3 Ultra can support a 1-million-token context window in practice rather than just on a spec sheet. NVIDIA also pre-trained the model in NVFP4, a lower-precision format that further improves throughput.

The result, according to early benchmarking on hosted endpoints, is over 300 output tokens per second — three to six times faster than comparable Chinese open-weight models like DeepSeek V4 Pro and Kimi K2.6, which run at roughly 50-100 tokens per second.

Benchmarks: Fast, Capable — But Not the World Leader

On the Artificial Analysis Intelligence Index, Nemotron 3 Ultra scores 47.7, putting it comfortably ahead of the next-best US open-weight models: Gemma 4 31B (39.2) and NVIDIA's own smaller Nemotron 3 Super (36.0). That makes it the strongest open-weight model trained in the US by a clear margin.

It's worth being honest about where it stands globally, though. China's Kimi K2.6 still leads the open-weight field overall with a score of 53.9. Nemotron 3 Ultra's real edge isn't raw intelligence-index supremacy — it's the combination of strong reasoning, a genuinely usable long context window, and inference speed that few models in its capability class can match.

A Genuinely Open License

NVIDIA didn't just release the model weights. The company published base weights, post-trained checkpoints, reward models, NVFP4-quantized variants, training recipes, and datasets — all under OpenMDW-1.1, a permissive open AI model license maintained by the Linux Foundation.

OpenMDW is designed to cover the entire bundle of model materials (architecture, parameters, documentation, and related software) under one consistent set of terms, rather than the patchwork of custom licenses many model releases ship with. For companies that need legal clarity before deploying an open model commercially, that consistency reduces a real source of friction.

Nemotron 3 Ultra is already available on Hugging Face, OpenRouter, and NVIDIA NIM, so there's no shortage of ways to start testing it.

What It Means for Builders and Agencies

A few practical takeaways if you're choosing models for agents, RAG pipelines, or other production AI workloads:

Long-context agents get cheaper to run. The Mamba-2/Transformer hybrid makes million-token context genuinely usable rather than prohibitively slow, which matters for agents that need to reason over long documents, codebases, or conversation histories.
Speed is now a differentiator, not just intelligence. At 300+ tokens per second, Nemotron 3 Ultra is fast enough for latency-sensitive, long-running agent workflows where slower frontier models struggle.
Licensing clarity is underrated. OpenMDW-1.1's single, comprehensive license is a real advantage over open releases that bundle separate terms for weights, code, and datasets.
It's a genuine alternative to proprietary APIs for self-hosted or NVIDIA NIM deployments — particularly for teams that want an open, inspectable model with US provenance.

Key Takeaways

NVIDIA released Nemotron 3 Ultra on June 4, 2026: 550B total parameters, 55B active, hybrid Mamba-2/Transformer/MoE architecture, 1M-token context.
It's the strongest US-trained open-weight model on the Artificial Analysis Intelligence Index (47.7), though still behind China's Kimi K2.6 (53.9).
It runs at 300+ tokens per second — 3-6x faster than comparable Chinese open-weight models.
Released under OpenMDW-1.1, a permissive, Linux Foundation-backed license covering weights, code, and datasets.
Available now on Hugging Face, OpenRouter, and NVIDIA NIM — worth evaluating for long-context, latency-sensitive agent workloads.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.