Artificial intelligence systems are increasingly shaped by human feedback, but the true nature of this feedback has remained elusive. A new study introduces What's My Feedback? (WIMHF), a method that automatically uncovers the subtle preferences encoded in datasets used to train language models. This approach reveals that human feedback often contains unexpected biases, such as preferences for toxic content or against safety refusals, which can lead models to learn undesirable behaviors. For non-technical readers, this matters because it highlights how AI systems might inadvertently amplify human flaws, affecting everything from online interactions to automated decision-making.
The key finding is that WIMHF identifies small, interpretable features in feedback data that account for the majority of preference signals. Using sparse autoencoders, the method learns measurable differences between pairs of responses, such as whether one uses emojis or adopts a formal tone. These features are then described in natural language, like 'uses emojis' or 'refuses requests,' and linked to human preferences through regression analysis. For instance, in the LMArena dataset, a feature indicating refusal of requests was associated with a 31% decrease in win-rate, showing annotators often disprefer safe responses.
Methodologically, WIMHF involves three steps: learning differences between response pairs with autoencoders, describing these differences in plain language, and estimating their impact on preferences while controlling for known factors like response length. The researchers applied this to seven datasets, including Community Alignment and HH-RLHF, using embeddings from models like OpenAI's text-embedding-3-small. The autoencoders were trained to produce sparse, interpretable representations, with an average of only four active features per input, yet they captured 67% of the predictive signal achievable by more complex models.
Results analysis shows that WIMHF uncovers diverse and sometimes conflicting preferences across datasets. For example, in Chatbot Arena, annotators tended to favor responses that fulfill harmful requests over those that refuse, as indicated by features like 'engages in violent/illegal requests' increasing win-rate. In contrast, datasets like PRISM showed preferences for neutrality in discussions on topics like religion. The method also identified features with high predictive power, such as 'prose, not lists' in Community Alignment, which decreased win-rate by 48%. These findings are supported by validation against annotator-written explanations, with WIMHF features matching explanations in over 60% of cases.
In terms of real-world implications, WIMHF enables practical interventions. By flipping labels on misaligned examples—such as those where annotators preferred unsafe responses—researchers improved model safety by 37% on RewardBench without compromising general performance. Additionally, the method supports personalization; in Community Alignment, using annotator-specific weights for features like response formatting improved prediction accuracy efficiently. This suggests that developers can use WIMHF to curate better datasets and tailor AI behaviors, reducing risks like reward hacking or echo chambers.
Limitations from the paper include that WIMHF describes correlations rather than causal relationships, and the features may not fully capture continuous activation distributions. The approach relies on dataset-specific autoencoders, and its effectiveness can vary with the diversity of responses. Future work could explore incorporating prompt information more deeply to enhance interpretability and application.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn