AI Writes Text by Deleting and Inserting Words

TL;DR

DID replaces masking in language models with deletion and insertion, speeding up training and inference while handling variable-length text.

A new approach to building language models could make AI text generation faster and more flexible by eliminating a common computational bottleneck. Researchers have developed Deletion-Insertion Diffusion language models (DID), which replace the masking and unmasking processes used in current diffusion models with token deletion and insertion. This change removes the need for non-informative and tokens, saving significant computational resources during both training and inference. also natively supports variable-length sequences, allowing models to generate text of different lengths without padding, and includes a self-correction mechanism that dynamically adjusts token positions during generation.

The key finding is that DID improves efficiency and flexibility over existing Masked Diffusion Language Models (MDLMs). In experiments, DID achieved training speedups of up to 1.99× for fixed-length data and 3.42× for variable-length data compared to MDLM baselines like RADD. Inference was accelerated by up to 1.58× and 3.79× in these settings, respectively. When aligned by computational budget, DID outperformed RADD on zero-shot language modeling perplexity across seven datasets, including WikiText and Lambada, as shown in Table 1. For variable-length generation, DID produced text with lower generative perplexity and better alignment to data length distributions, as demonstrated in Table 4 and Figure 2.

Ology involves formulating deletion and insertion as discrete diffusion processes within a continuous-time Markov chain framework. Instead of masking tokens, DID progressively deletes them in a forward process until the sequence is empty, then reconstructs text by inserting tokens in a backward process. The researchers designed a score-based training objective called Denoising Insertion Score Entropy (DISE), which targets insertion scores for each possible token and position. To compute this efficiently, they developed a parallelized dynamic programming algorithm that solves subsequence counting problems, reducing time complexity from O(mn^2V) to O(mn) for sequences of lengths m and n with vocabulary size V.

Analysis from the paper shows DID's advantages in both fixed and variable-length settings. In fixed-length benchmarks, DID-F (FLOPs-aligned) models consistently outperformed RADD on zero-shot perplexity, with improvements such as 36.91 vs. 38.27 on WikiText for small models. Generative perplexity evaluated by GPT2 Large was lower for DID with fewer denoising steps, as seen in Table 2. For variable-length data, DID maintained stable generative perplexity across different step counts, unlike baselines, and its length distribution closely matched the training data, as shown in Figure 2. The self-correction mechanism allowed dynamic adjustments during generation, reducing error accumulation.

Of this research are significant for practical AI applications, as it addresses inefficiencies in current diffusion models. By eliminating and tokens, DID reduces computational overhead, making training and inference faster and more cost-effective. The native support for variable-length sequences means models can handle real-world text data without artificial padding, improving generation quality and consistency. This could benefit areas like content creation, chatbots, and data analysis where text length varies naturally. The self-correction feature also enhances robustness, potentially leading to more coherent and accurate AI-generated text.

Limitations of the study include that DID has not yet integrated advanced optimizations like hybrid models or sophisticated inference algorithms, which are common in more established MDLMs. The paper notes that models were trained at a relatively small scale due to resource constraints, leaving performance on larger tasks unexplored. Additionally, the dynamic programming implementation, while efficient, adds constant overhead and may require further system-level support for variable-length data to maximize speedups. Future work could focus on scaling up DID and adapting existing optimizations from other diffusion models to further enhance its capabilities.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn