Inception Labs Mercury 2025: Diffusion AI 10× Faster than GPT-5

Inception Labs’ Mercury is the world’s first diffusion-based large language model (dLLM), achieving 10× faster text generation than leading autoregressive systems such as GPT-5 and Claude 4. Developed by a Stanford-led team, Mercury redefines speed, efficiency, and scalability in modern LLMs.

Inception Labs, founded by Stanford’s Stefano Ermon, UCLA’s Aditya Grover, and Cornell’s Volodymyr Kuleshov, has launched Mercury, a groundbreaking diffusion-based large language model (dLLM) that represents a major shift from traditional autoregressive LLMs.

Mercury is 10 times faster than today’s speed-optimized language models, generating over 1,000 tokens per second on standard NVIDIA H100 GPUs — a throughput previously achievable only with specialized AI chips. The first commercial product in the Mercury lineup, Mercury Coder, delivers exceptional performance in code generation while maintaining response quality across major benchmarks, outperforming leading autoregressive systems such as GPT-5 and Claude 4 in both speed and efficiency.

Limitations of Traditional Autoregressive Models

Current large language models are autoregressive as they generate text sequentially from left to right, one token at a time, and the generation of each token requires a neural network to evaluate billions of parameters, resulting in increased latency and computational costs especially when the model performs complex reasoning tasks. Traditional LLMs inherently employ sequential process, which decreases the model’s speed and efficiency.

Mercury’s Diffusion-Based Approach: Enhancing Speed and Efficiency

Mercury, based on the diffusion large language model(dLLMs), implements a “coarse-to-fine” methodology for the generation process that starts with pure noise and iteratively refines the output through several denoising steps. The Diffusion models can simultaneously generate and modify large blocks of text rather than sequentially, enabling the model to reason more effectively and structure its responses, leading to fewer mistakes and hallucinations.

So far, diffusion models have been used in video, image, and text generation. Sora, Midjourney, and Riffusion are prominent examples of diffusion model usage. However, the introduction of the Mercury coder represents a significant breakthrough in the successful application of diffusion models in text or code generation. Mercury Coder, specially designed for code generation, while maintaining its performance, is 10x faster than existing speed-optimized LLM applications.

Performance Benchmarks

Speed Advantage:

Mercury Coder operates at over 1,000 tokens per second on standard NVIDIA H100 GPUs, achieving a 5x speed increase as compared to current autoregressive models which, even when optimized for speed, typically process up to 200 tokens per second. In contrast to some leading models that run at less than 50 tokens per second, Mercury Coder offers more than a 20x speed improvement.

Hardware Independence:

The high throughput of dLLMs was previously attainable only with specialized hardware like Groq, Cerebras, and SambaNova. However, Mercury Coder’s algorithmic enhancements are not reliant on hardware acceleration, which implies that these speed gains can further benefit from faster chips.

Developer Preference:

Developers prefer Mercury’s code completions over other existing code models. In the “Copilot Arena” benchmark, Mercury Coder Mini achieved a tie for second place, outperforming speed-optimized models such as GPT-4o Mini and Gemini-1.5-Flash and larger models like GPT-4o. Additionally, the Mercury Coder Mini is approximately four times faster than the GPT-4o Mini.

Enterprise Applications:

Mercury, now available for testing in a playground hosted in partnership with Lambda Labs, is allowing developers to experience its speed and accuracy firsthand. Mercury Coder Mini and Mercury Coder Small are available via API or on-premise deployments. Inception Labs claims that both models are fully compatible with existing hardware, datasets, and supervised fine-tuning (SFT) and alignment (RLHF) pipelines. For enterprise customers, the company is offering support for model fine-tuning for both deployment options, enabling Mercury Coder to be adapted for various use cases. Mercury Coder, designed for conversational application is currently in closed beta testing. The company is actively testing its technology, with some customers already beginning to replace autoregressive models with Mercury.

Mercury’s diffusion-based design signals a new class of reasoning models that combine parallel text generation with energy-based denoising. If validated at scale, this could reshape transformer foundations used by OpenAI, Anthropic, and Google—ushering in faster, cheaper, and more stable AI generation.

Future directions

While Inception Labs has not publically disclosed its plans, it emphasizes its commitment to advancing AI technologies and delivering best-in-class models and solutions. They offer API access and support on-premise deployments for enterprise customers, indicating a focus on expanding Mercury’s applications and accessibility.

We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation. pic.twitter.com/HfjDdoSvIC
— Inception Labs (@InceptionAILabs) February 26, 2025

Inception Labs‘ Mercury marks a paradigm shift in large language model technology, challenging current autoregressive models with a diffusion-based approach that offers enhanced speed and efficiency. Mercury Coder, achieving over 1,000 tokens per second on standard hardware while maintaining the quality of the responses, proved that diffusion models can revolutionize text generation. As the technology evolves, it could transform the development and deployment of language models, making high-quality AI more accessible and efficient across various industries. Mercury’s success signals a new era for AI applications, leveraging its advantages for faster and cost-effective solutions.

Read more from Poniak Times

Poniak Times

Or check our Popular Categories...

Poniak Times

Or check our Popular Categories...

Inception Labs Mercury 2025: Diffusion AI 10× Faster than GPT-5

Limitations of Traditional Autoregressive Models

Mercury’s Diffusion-Based Approach: Enhancing Speed and Efficiency

Performance Benchmarks

Speed Advantage:

Hardware Independence:

Developer Preference:

Enterprise Applications:

Future directions

Poniak Research

Related Posts

The Decoupled Agentic OS: Re-Architecting AI Agent Infrastructure

Governance-Aware AI Architecture: Turning OpenAI’s Safety Framework into Production Systems

Leave a Reply Cancel reply

Other Story

DiffusionGemma Explained: Google’s Text Diffusion Model for Faster Local AI

The Decoupled Agentic OS: Re-Architecting AI Agent Infrastructure

The Economics of Production AI Agents: How to Stop Losing Money on Token Loops

Governance-Aware AI Architecture: Turning OpenAI’s Safety Framework into Production Systems

Why Thin AI Agents Fail: The 9 Layers Behind Reliable AI Systems

OpenAI’s Frontier Governance Framework: What It Means for AI Safety and Regulation