AI Question Generators Still Fail Real-World Tests

TL;DR

New research shows AI optimized to ask better questions often fails human quality ratings, raising doubts about how we measure machine intelligence.

When machines learn to ask questions, we expect them to become better conversational partners and educational tools. But new research shows that improving AI question generators through standard optimization methods often fails to translate to what humans consider good questions, revealing a fundamental gap in how we train and evaluate artificial intelligence.

The researchers discovered that using reinforcement learning to optimize specific question qualities—fluency, relevance, and answerability—produces inconsistent results when judged by human evaluators. While automated metrics like BLEU scores showed improvement, human raters often found the optimized questions no better, and sometimes worse, than those from basic models. This disconnect between machine metrics and human judgment highlights a critical challenge in developing AI that genuinely understands and communicates effectively.

The team employed a sequence-to-sequence model as their base question generator, then fine-tuned it using three specialized reward systems. For fluency, they used language model perplexity to measure how natural the questions sounded. For relevance, they trained a discriminator to judge whether questions matched their source documents. For answerability, they used a question-answering system to determine if questions could be answered from the given text. Each reward was designed to push the AI toward generating higher-quality questions through reinforcement learning.

The results showed clear patterns in the data. When optimizing for individual rewards, the corresponding automated metrics improved—fluency optimization boosted fluency scores by 1.48 points, relevance by 1.06 points, and answerability by 1.30 points compared to the baseline. Joint optimization of all three rewards produced the best overall automated performance. However, human evaluation told a different story. Questions optimized for relevance showed the most consistent improvement in human ratings, while those optimized for answerability actually scored lower than the baseline in human assessment. The researchers found that 45% of questions from the answerability-optimized model asked about years or dates, compared to only 11.2% from the baseline, indicating the model had learned superficial patterns rather than genuine understanding.

This research matters because question generation sits at the heart of many practical AI applications. Educational systems use it to create assessments, conversational AI employs it to maintain engaging dialogue, and research tools leverage it to build training datasets. If our optimization methods don't align with human judgment, we risk creating AI that performs well on paper but fails in real-world use. The findings suggest that current evaluation metrics like BLEU may not adequately capture question quality, potentially misleading AI development efforts.

The study acknowledges several limitations. The answerability reward performed poorly because current question-answering systems struggle with complex reasoning, introducing incorrect biases into the training process. The language model used for fluency evaluation sometimes penalized valid questions containing rare entities or required commonsense knowledge. Most importantly, the research shows that we still lack reliable automated methods for assessing whether AI-generated questions meet human standards of quality, leaving a crucial gap between technical optimization and practical utility.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn