AI Creates Truly New Text, Not Just Regurgitation

TL;DR

A new study measures how often AI output can't be traced to training data, showing real capacity for original generation.

Understanding whether artificial intelligence can produce genuinely new content is crucial as AI systems become integral to writing, coding, and creative tasks. A new study introduces a method to quantify this novelty, showing that AI models often generate text that cannot be attributed to their training data, offering a fresh perspective on how these systems learn and create.

The key finding is that AI models can generate outputs with no close matches in their training corpora, indicating true novelty. Researchers defined un-attributability as the absence of any training example that overlaps or shares semantic context with the AI's output. This was measured using a two-stage pipeline: first, retrieving candidate text chunks from the training data with GIST embeddings and FAISS indexing, then reranking them with ColBERTv2 for fine-grained similarity analysis. If no candidate closely matches the AI generation, it is deemed novel, with scores normalized against human-written references to ensure accuracy.

Methodology involved applying this pipeline to outputs from models like SmolLM and SmolLM2 across various settings. For instance, in open-domain scenarios, the team analyzed generations from documents in the Dolma dataset, which was not part of the models' training data. They compared prompted generations, where the AI completes text based on a context, to unprompted ones, assessing how novelty changes with chunk sizes from 50 to 500 tokens. This approach is model-agnostic and scalable, allowing large-scale analysis without relying on causal methods that are computationally intensive.

Results analysis, as shown in Figure 3, reveals that unprompted AI outputs often have higher novelty scores than prompted ones, meaning they are less attributable to training data. For example, SmolLM2 models showed median ColBERTv2 similarity scores below the human baseline in many cases, indicating novel content. In domain-specific tasks like mathematical reasoning (GSM8K) and text rewriting (OpenRewriteEval), instruction-tuned models produced outputs with even lower similarity scores, suggesting enhanced novelty. Figure 4 illustrates that for GSM8K, correct answers from SmolLM2 had scores close to zero or negative, meaning they were not simply reproductions of training examples.

Contextually, this matters because it challenges the notion that AI merely memorizes data. Higher novelty implies better generalization, which is vital for applications like content creation and problem-solving where originality is valued. It also informs debates on intellectual property, as un-attributable outputs may reduce concerns about copyright infringement. For everyday users, this means AI tools could assist in generating unique ideas or solutions without relying on pre-existing text.

Limitations from the paper include dependence on embedding choices, which might introduce biases, and the high computational cost of storing and indexing large datasets (around 20 TB in this study). Additionally, the method does not capture causal relationships and is not a replacement for traditional attribution techniques. Future work could explore how novelty varies with different models and training strategies to further understand AI creativity.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn