AI Models Can Now Control How Long Their Responses Are

TL;DR

A new training method lets large language models hit exact word counts without losing quality, fixing a key flaw in writing tools.

Large language models like ChatGPT have become adept at following complex instructions, but they struggle with a surprisingly basic task: controlling how much they write. When asked to produce exactly 100 words or a 500-word report, these models often generate responses that are too short or excessively long, failing to meet precise length constraints. This limitation becomes increasingly problematic as AI is deployed for real-world applications such as creative writing, report generation, and academic assistance, where specific length requirements are common. A new study introduces a training framework called LARFT that directly addresses this issue by teaching models to internalize the concept of length, effectively closing the gap between understanding an instruction and executing it accurately.

The researchers discovered that the core problem lies in what they term the "cognition-action gap." While models can process length instructions, they lack an internal representation or awareness of length during the generation process. This deficit means that even when a model is trained with external signals or rewards to control output size, it fails to decouple length from semantic content, leading to imprecise control. The paper shows that existing s, which rely on adding length-specific tokens or using length as a reward signal in reinforcement learning, are insufficient because they treat length as an external constraint rather than an internal feature the model needs to understand. Consequently, models exhibit significant struggles, with outputs frequently deviating from target lengths, especially as context windows expand and demands for long-form generation grow.

To bridge this gap, the team developed LARFT (Length-Aware Reinforcement Fine-Tuning), a unified training framework that combines two key components. First, it uses length-oriented reinforcement learning, where the model receives a verifiable reward based on how closely its output matches the target length, guiding its generation actions. Second, and more innovatively, it incorporates hindsight length awareness: the model is trained to count the words in its own generated outputs. By relabeling on-policy data—samples produced during training—with prompts like "Count the words in the text above," the model learns to internalize length concepts from its own experiences. This dual approach allows the model to develop an internal representation of length while refining its policy to satisfy constraints, creating a positive feedback loop where enhanced awareness improves generation control.

, Detailed across four base models including Qwen2.5 and Llama variants, demonstrate LARFT's effectiveness. On three length instruction following benchmarks—LIFEBench, LongBench, and Lenctrl-Bench—LARFT achieved an average improvement of +20.92 points in length-following capability compared to untrained models, and outperformed the strongest reinforcement learning baseline by 4.59 points. For instance, on LIFEBench, which evaluates a wide range of length constraints, LARFT obtained the highest Length Score and lowest Length Deviation across all models, with improvements of up to 12.30 points over the second-best . Crucially, this specialized enhancement came with minimal cost to general capabilities: performance on benchmarks like MMLU, GSM8K, and GPQA showed only a marginal decline of -1.45 points on average, and in some cases, generation quality even improved slightly. Ablation studies confirmed that both components of LARFT are essential, with the hindsight awareness mechanism providing the cognitive foundation that makes precise control possible.

Of this research extend to numerous practical applications where precise length control is vital. In creative writing, AI could generate stories or articles that adhere to specific word counts without becoming verbose or truncated. For report generation in business or academia, models could produce summaries or analyses that meet strict formatting requirements, improving efficiency and reliability. also addresses resource concerns, as uncontrolled long-form generation can lead to excessive token consumption and increased inference latency. By enabling models to balance length constraints with content quality, LARFT could make AI tools more practical and trustworthy for everyday use, from drafting emails to composing technical documents.

Despite its successes, the study acknowledges limitations. The evaluation was restricted to targets under 4,000 words due to the maximum output length limitations of the base models, leaving open questions about performance on extremely long generations. Additionally, while LARFT maintains general capabilities, there is a slight trade-off, with minor dips observed on some benchmarks, indicating that further optimization may be needed to eliminate any residual impact. The researchers also note that their focuses on length control in text generation, and future work should explore its generalization to other generation scenarios, such as logical reasoning or stylistic control, where similar cognition-action gaps might exist.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn