AI Models Can Tell What Kind of Text They Process

TL;DR

Language models store text genres in their activations. Simple probes read these patterns accurately, a key step toward better AI transparency.

A new study shows that large language models (LLMs) like Mistral-7B internally organize text into recognizable categories such as narratives, instructions, and explanations, even when not explicitly trained to do so. This finding, published in a preprint paper, suggests that AI systems may have built-in structures for understanding different types of writing, which could help researchers monitor and interpret their outputs more effectively. For non-technical readers, this means that the "brains" of AI models might be more interpretable than previously thought, offering a window into how these systems process and generate human-like text.

The key finding from the research is that text genres—like instructional, explanatory, speech, narrative, and code—can be predicted from the activations inside an LLM with high accuracy. Using Mistral-7B, a model with 7 billion parameters, the researchers achieved F1-scores of up to 98% on a synthetic dataset and 71% on a real-world dataset called CORE. These scores, which measure classification performance, consistently outperformed control tasks where the model's parameters were randomized, indicating that the genre information is genuinely encoded in the activations. This demonstrates that LLMs naturally develop representations for different text types, which can be extracted using simple machine learning tools.

To conduct the study, the researchers created two datasets: a synthetic one with 3,914 text chunks generated by Mistral-7B and labeled by GPT-4 Turbo, and the CORE dataset, a pre-existing corpus of online English texts. They prompted Mistral-7B with text chunks and extracted activations—internal signals produced as the model processes text—from each layer of the transformer architecture. These activations were then used as input for shallow learning classifiers from the scikit-learn library, such as linear probes, to predict the genre labels. ology included a control task with random model parameters to ensure that were not due to spurious correlations, following established practices in probing research.

, Detailed in Figure 5 of the paper, show that classification accuracy improves as probes move deeper into the model's layers, with the best performance occurring in later layers. For the synthetic dataset, F1-scores reached 0.98, while for the CORE dataset, they peaked at 0.71, likely due to the greater similarity between categories in that dataset. Dimensionality reduction visualizations in Figures 3 and 4 revealed that text genres form distinct clusters in the embedding space for the synthetic data but overlap more in the CORE data, supporting the classification . This analysis confirms that LLMs encode high-level text structures in a way that is accessible to external probes, providing a proof of concept for interpreting model internals beyond single tokens.

Of this research are significant for improving AI transparency and safety. By showing that text genres can be inferred from activations, it opens the door to better monitoring of LLM outputs, potentially helping to detect when models generate inappropriate or biased content. For everyday users, this could lead to more trustworthy AI systems that are easier to audit and control. The study also lays groundwork for future work that might predict longer sequences of text categories or incorporate additional properties like topic or emotion, enhancing our ability to understand and guide AI behavior in real-world applications.

However, the study has limitations, as noted in the paper. It focuses solely on Mistral-7B, leaving open the question of whether similar would hold for other models. Additionally, only two datasets were used, which may not capture the full diversity of text genres. Future research could explore more models and datasets to generalize these and investigate what drives performance variations. Despite these constraints, the research provides a promising step toward a predictive framework for AI interpretability, moving beyond token-level analysis to understand how models represent larger chunks of text.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn