top of page
Search

Diffusion LLM: Closer to Human Thought

Writer's picture: Juan Manuel  Ortiz de ZarateJuan Manuel Ortiz de Zarate

In recent weeks, the AI community has been abuzz with the launch of Mercury by Inception Labs [1], a groundbreaking diffusion-based large language model (dLLM) designed to drastically accelerate text generation. Mercury showcases the power of diffusion models for discrete data, achieving speeds up to 10 times faster than traditional autoregressive models while maintaining high-quality output. This major advancement reinforces the growing relevance of diffusion techniques in natural language processing and aligns closely with the principles behind Score Entropy Discrete Diffusion (SEDD)[2], the focus of this article.


dLLM performance vs state of the art LLMs
dLLM performance vs state of the art LLMs

What makes these new diffusion-based models particularly exciting is how closely they resemble human reasoning. Instead of making decisions step by step, like traditional autoregressive systems, models like SEDD refine ideas over multiple passes, much like how humans iteratively improve thoughts or sentences. This iterative, holistic approach offers a more flexible and natural pathway to generating language and code., the AI community has been abuzz with the launch of Mercury by Inception Labs, a groundbreaking diffusion-based large language model (dLLM) designed to drastically accelerate text generation. Mercury showcases the power of diffusion models for discrete data, achieving speeds up to 10 times faster than traditional autoregressive models while maintaining high-quality output. This major advancement reinforces the growing relevance of diffusion techniques in natural language processing and aligns closely with the principles behind Score Entropy Discrete Diffusion (SEDD), the focus of this article.


Diffusion models[4] have revolutionized generative modeling, achieving state-of-the-art performance in domains like image and audio synthesis. However, their application to discrete data—such as text—has been challenging. Traditional diffusion models rely on score matching, which is well-suited for continuous spaces but struggles with discrete structures. In response to this limitation, a new approach called Score Entropy Discrete Diffusion (SEDD) has been developed. This article explores the core principles behind SEDD, its advantages over previous discrete diffusion models, and its implications for generative modeling, particularly in natural language processing


Challenges in Generative Modeling of Discrete Data

Autoregressive models, which generate data token by token [3] using the probabilistic chain rule, have long dominated discrete generative modeling. While powerful, they suffer from drawbacks such as slow sequential sampling, difficulty in incorporating global structure, and reliance on heuristic sampling techniques like nucleus sampling to maintain coherence.

Diffusion models offer an alternative paradigm. They work by progressively corrupting data through a noise process and then learning to reverse this process to generate new samples. In continuous domains, this is effectively done using score matching, where the model learns the gradient of the data distribution. However, in discrete spaces, traditional score matching does not translate well, leading to less effective discrete diffusion models.

The Score Entropy Discrete Diffusion Framework

SEDD introduces a novel loss function, score entropy, which extends score-matching principles to discrete spaces. Instead of directly estimating score functions (which are gradients in continuous spaces), SEDD parameterizes the reverse diffusion process using ratios of the data distribution. This is done through a set of probability mass ratios that approximate how one discrete state transitions to another.

How It Works:

  1. Discrete Diffusion Process: SEDD begins with a clean data sample and gradually applies random noise to corrupt it over a series of time steps. This process is modeled as a continuous-time Markov chain over a discrete state space. Each token in the sequence is independently perturbed, and the cumulative effect across all tokens drives the data distribution toward a simple, known prior, such as a uniform distribution or an absorbing state with a designated MASK token. This controlled corruption allows the model to handle discrete structures by ensuring that the noise introduced remains manageable and reversible.

  2. Reverse Process with Ratio Estimation: After the data is fully noised, SEDD learns to reverse this diffusion process. Unlike traditional methods that attempt to predict the denoised sample directly, SEDD models the ratios between the probabilities of neighboring states. These ratios, known as concrete scores, provide the relative likelihoods of state transitions, enabling the model to reconstruct the original data by progressively denoising each step. This approach is particularly advantageous in discrete spaces, where gradients are not well-defined, as it relies on relative comparisons between discrete outcomes rather than continuous gradients.

  3. Score Entropy Loss: The training of SEDD is driven by the score entropy loss, which generalizes traditional score matching to discrete domains. This loss function ensures that the estimated ratios are positive and stable, correcting for the shortcomings of earlier discrete diffusion losses that allowed invalid negative or zero values. The score entropy loss penalizes deviations from the true ratios while reinforcing valid transitions, effectively acting as a regularizer that guides the model toward accurate reverse transitions. Moreover, SEDD utilizes a denoising variant of this loss to make training scalable, leveraging noisy intermediate states to predict clean data while minimizing computational overhead. This makes SEDD both theoretically sound and practically efficient for high-dimensional tasks such as language modeling. Training SEDD involves minimizing the score entropy loss, which penalizes deviations between the model’s estimated ratios and the true underlying ratios of the data distribution. The loss is designed to maintain positive, stable, and well-behaved ratio estimates, preventing divergence issues common in earlier discrete diffusion models. Moreover, the denoising variant of this loss further improves scalability and computational efficiency by focusing on recovering clean data from noisy inputs.


Advantages of SEDD Over Previous Approaches

SEDD provides several groundbreaking advantages over prior discrete diffusion models and autoregressive models, positioning it as a next-generation approach for discrete generative tasks:

  • Higher Modeling Accuracy: Through its innovative score entropy loss and ratio-based parameterization, SEDD delivers superior modeling accuracy across various benchmarks [5]. By directly learning the probability ratios between discrete states, it minimizes common errors and instability found in previous discrete diffusion methods. This allows SEDD to outperform other language diffusion models and even challenge well-established autoregressive models like GPT-2 in perplexity benchmarks, reflecting more coherent and contextually relevant text generation.

  • Efficient Sampling: Traditional autoregressive models are inherently sequential, generating tokens one after another, which becomes computationally expensive for long sequences. SEDD, however, leverages parallel sampling by updating multiple tokens simultaneously, dramatically reducing generation time. This parallelism makes SEDD particularly suitable for large-scale applications, allowing it to achieve high-quality outputs with fewer network evaluations, making it both cost-effective and scalable.

  • Flexibility in Conditioning: SEDD's unique approach to modeling probability ratios allows for dynamic and versatile text generation. Unlike autoregressive models that rely on left-to-right token generation, SEDD can handle arbitrary prompt positions and seamlessly fill in missing text (infilling) without additional retraining. This flexibility is particularly valuable for creative writing, editing, and interactive AI applications where users may wish to modify or complete text in non-linear ways.

  • Robustness Without Heuristic Tricks: Autoregressive models often depend on heuristics like nucleus sampling or temperature scaling to produce coherent text and avoid degeneration. SEDD naturally maintains high-quality generations without these techniques, offering a more principled and stable framework for text generation.

  • Trade-Off Between Compute and Quality: SEDD introduces a controllable balance between computational cost and generation quality. By adjusting the number of diffusion steps, users can tailor the generation process to prioritize either speed or fidelity. This trade-off enables practical deployment in diverse environments, from lightweight applications to high-fidelity creative tasks.

Collectively, these advantages establish SEDD as a versatile, high-performance alternative to existing generative models, opening new opportunities for innovation in discrete data generation.


Empirical Results and Performance


SEDD's effectiveness has been validated through comprehensive evaluations on standard language modeling benchmarks, solidifying its status as a leading alternative to traditional autoregressive models. The model was rigorously tested across diverse datasets, including WikiText-2, One Billion Words (1BW), LAMBADA, PTB, and WikiText-103, covering a range of language modeling challenges from character-level prediction to large-scale, real-world textual corpora. These extensive experiments provide strong evidence of SEDD's capacity to generalize across tasks and datasets. On datasets such as WikiText, One Billion Words, and LAMBADA, SEDD consistently achieves superior performance compared to prior discrete diffusion models. Notably, SEDD demonstrates perplexity reductions of 25-75%, marking a substantial improvement in capturing the structure and semantics of natural language.


Zero-shot unconditional perplexity (↓) on a variety of datasets
Zero-shot unconditional perplexity (↓) on a variety of datasets

SEDD consistently showcases perplexity reductions of 25-75% compared to other discrete diffusion models, demonstrating substantial gains in both syntactic and semantic accuracy. Particularly impressive is SEDD's ability to rival and, in some cases, surpass the performance of well-established autoregressive models like GPT-2. In zero-shot perplexity tasks, SEDD not only matches GPT-2 but outperforms it on several benchmarks without relying on sampling tricks like temperature scaling or nucleus sampling. This highlights SEDD’s capacity to generate high-quality text inherently through its diffusion-based architecture.


Beyond perplexity, SEDD excels in generative quality. The model produces fluent, contextually appropriate text while requiring significantly fewer diffusion steps compared to other models. This reduction in steps translates to lower computational costs, making it practical for deployment at scale. SEDD establishes a predictable trade-off curve, allowing users to optimize between generation speed and text quality without compromising coherence. Additionally, SEDD excels in conditional generation tasks such as infilling and arbitrary prompting, offering flexibility that is difficult to achieve with autoregressive models. This makes SEDD especially valuable for interactive applications where dynamic, on-the-fly generation is required.


Furthermore, SEDD’s performance in conditional generation tasks—such as infilling, arbitrary prompting, and completing masked sequences—demonstrates unmatched flexibility. Unlike autoregressive counterparts bound by left-to-right token generation, SEDD dynamically adapts to various contextual constraints. This versatility makes it especially appealing for interactive tools, content creation workflows, and applications requiring high-quality completions in real-time.


These empirical strengths position SEDD as a transformative, scalable, and efficient model for real-world natural language generation, setting a new benchmark for discrete diffusion techniques and driving forward the future of text-based AI.


Real-World Implementation: Mercury Coder by Inception Labs


A notable real-world application of score entropy-based diffusion modeling has emerged with Mercury Coder, developed by Silicon Valley startup Inception Labs. Mercury Coder leverages diffusion principles to generate high-quality code, offering small and mini versions of the model. Although many technical details such as parameter count, input size, and training data remain undisclosed, Mercury Coder applies score entropy techniques by estimating the transition ratio between tokens, similar to the methodology proposed in SEDD.


Mercury Performance and state-of-the-art LLMs across several benchmarks
Mercury Performance and state-of-the-art LLMs across several benchmarks

In practice, Mercury Coder operates by progressively masking tokens during training and learning to unmask them over several steps during inference. This process refines its outputs iteratively, similar to image diffusion models but applied to code generation. Early benchmarks show impressive speed and performance gains. Running on an Nvidia H100 GPU, Mercury Coder Small achieves 737 tokens per second, and Mercury Coder Mini reaches 1,109 tokens per second—both significantly faster than comparable coding models like Qwen 2.5 Coder 7B and GPT-4o Mini. Mercury Coder also demonstrates competitive accuracy, outperforming popular models like Gemini 2.0 Flash-Lite and Claude 3.5 Haiku on several coding benchmarks.


This successful deployment of a diffusion-based code generator highlights the growing maturity of discrete diffusion techniques for real-world applications and reinforces the promise of SEDD and similar frameworks for scalable, high-speed generative tasks.


Implications and Future Directions


The success of SEDD opens new possibilities for discrete generative modeling, signaling a shift in how we approach complex, structured data generation. Beyond natural language processing, SEDD's architecture holds promise for expanding into domains like protein sequence design, symbolic mathematics, genomic data synthesis, and even music composition—any field where discrete data forms the foundation. The ability to model transitions between discrete states with precision unlocks opportunities for creating more controllable, robust, and efficient generative systems.


Future research could focus on several key areas. One direction is refining the score entropy loss to further enhance stability and scalability, particularly in extremely high-dimensional settings. Another promising avenue is developing hybrid models that blend the strengths of SEDD with autoregressive frameworks, combining the parallelism and flexibility of diffusion with the strong local coherence of sequential models. Additionally, applying SEDD to larger-scale language models and exploring domain-specific adaptations could drive meaningful improvements in generative performance across diverse industries, from biomedicine to creative arts. By improving the ability of diffusion models to handle categorical data, SEDD could enhance applications in NLP, protein sequence generation, and other structured data domains. Future research could further refine its loss function, explore hybrid architectures combining SEDD with autoregressive components, and apply the technique to even larger language models.


Conclusion


Score Entropy Discrete Diffusion represents a transformative leap in generative modeling for discrete data. By introducing a novel score entropy loss and parameterizing probability ratios, SEDD overcomes the inherent limitations of previous discrete diffusion models and bridges the gap between diffusion approaches and the longstanding dominance of autoregressive methods. Its ability to produce high-quality, coherent, and controllable outputs with remarkable efficiency marks a new chapter in the evolution of text and categorical data generation.


As evidenced by real-world implementations like Mercury Coder, the impact of SEDD's underlying principles is already extending beyond research into practical, high-performance applications. With continued innovation and broader adoption, SEDD is poised to play a central role in the next generation of generative AI, reshaping the landscape of natural language processing, code generation, and other structured data domains for years to come. By introducing a novel loss function and parameterizing probability ratios, it overcomes key limitations of previous approaches, making discrete diffusion more competitive with traditional autoregressive models. With continued development, SEDD has the potential to reshape how we approach text and categorical data generation in AI applications.


References



[2] Lou, A., Meng, C., & Ermon, S. (2023). Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834.


[3] The Mathematics of Language, Transcendent AI


[4] Diffusion model, Wikipedia


Comments


bottom of page