
Inception Labs, a startup founded by professors from Stanford, UCLA, and Cornell – Stefano Ermon Aditya Grover and Volodymyr Kuleshov,has recently launched Mercury, a groundbreaking series of diffusion-based large language models(dLLMs) that marks a significant shift from the current autoregressive large language model. Mercury is 10 times faster than existing speed-optimized LLMs at the technology frontier, generating over 1000 tokens per second on NVIDIA H100 GPUs, a speed previously achievable only with specialized AI chips. The first commercial product of the Mercury series is Mercury Coder which showcased exceptional performance in code generation while maintaining quality across various benchmarks when compared to leading speed-optimized autoregressive models like GPT-4o Mini and Claude 3.5 Haiku.
Limitations of Traditional Autoregressive Models:
Current large language models are autoregressive as they generate text sequentially from left to right, one token at a time, and the generation of each token requires a neural network to evaluate billions of parameters, resulting in increased latency and computational costs especially when the model performs complex reasoning tasks. Traditional LLMs inherently employ sequential process, which decreases the model’s speed and efficiency.
Mercury’s Diffusion-Based Approach: Enhancing Speed and Efficiency:
Mercury, based on the diffusion large language model(dLLMs), implements a “coarse-to-fine” methodology for the generation process that starts with pure noise and iteratively refines the output through several denoising steps. The Diffusion models can simultaneously generate and modify large blocks of text rather than sequentially, enabling the model to reason more effectively and structure its responses, leading to fewer mistakes and hallucinations.
So far, diffusion models have been used in video, image, and text generation. Sora, Midjourney, and Riffusion are prominent examples of diffusion model usage. However, the introduction of the Mercury coder represents a significant breakthrough in the successful application of diffusion models in text or code generation. Mercury Coder, specially designed for code generation, while maintaining its performance, is 10x faster than existing speed-optimized LLM applications.
Performance Benchmarks:
Speed Advantage:
     Mercury Coder operates at over 1,000 tokens per second on standard NVIDIA H100 GPUs, achieving a 5x speed increase as compared to current autoregressive models which, even when optimized for speed, typically process up to 200 tokens per second. In contrast to some leading models that run at less than 50 tokens per second, Mercury Coder offers more than a 20x speed improvement.
Hardware Independence:
The high throughput of dLLMs was previously attainable only with specialized hardware like Groq, Cerebras, and SambaNova. However, Mercury Coder’s algorithmic enhancements are not reliant on hardware acceleration, which implies that these speed gains can further benefit from faster chips.
Developer Preference:
Developers prefer Mercury’s code completions over other existing code models. In the “Copilot Arena” benchmark, Mercury Coder Mini achieved a tie for second place, outperforming speed-optimized models such as GPT-4o Mini and Gemini-1.5-Flash and larger models like GPT-4o. Additionally, the Mercury Coder Mini is approximately four times faster than the GPT-4o Mini.
Enterprise Applications :
Mercury, now available for testing in a playground hosted in partnership with Lambda Labs, is allowing developers to experience its speed and accuracy firsthand. Mercury Coder Mini and Mercury Coder Small are available via API or on-premise deployments. Inception Labs claims that both models are fully compatible with existing hardware, datasets, and supervised fine-tuning (SFT) and alignment (RLHF) pipelines. For enterprise customers, the company is offering support for model fine-tuning for both deployment options, enabling Mercury Coder to be adapted for various use cases. Mercury Coder, designed for conversational application is currently in closed beta testing. The company is actively testing its technology, with some customers already beginning to replace autoregressive models with Mercury.
- Future directions:
While Inception Labs has not publically disclosed its plans, it emphasizes its commitment to advancing AI technologies and delivering best-in-class models and solutions. They offer API access and support on-premise deployments for enterprise customers, indicating a focus on expanding Mercury’s applications and accessibility.
We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation. pic.twitter.com/HfjDdoSvIC
— Inception Labs (@InceptionAILabs) February 26, 2025
Inception Labs‘ Mercury marks a paradigm shift in large language model technology, challenging current autoregressive models with a diffusion-based approach that offers enhanced speed and efficiency. Mercury Coder, achieving over 1,000 tokens per second on standard hardware while maintaining the quality of the responses, proved that diffusion models can revolutionize text generation. As the technology evolves, it could transform the development and deployment of language models, making high-quality AI more accessible and efficient across various industries. Mercury’s success signals a new era for AI applications, leveraging its advantages for faster and cost-effective solutions.