Text-to-image AI models can create stunning visuals from simple prompts, but they often produce generic that don't match individual preferences. Current evaluation s rely on one-size-fits-all metrics or fixed criteria, failing to capture the diverse ways people judge images. Researchers from Yonsei University have developed PIGReward, a personalized reward model that adapts its evaluation to each user's unique visual tastes, offering a scalable solution to this .
PIGReward works by dynamically generating user-specific evaluation dimensions based on limited reference data, such as a few images a person has liked or disliked in the past. Instead of using predefined criteria like "alignment" or "detail," it infers personalized aspects—such as "atmosphere" or "symbolism"—that matter most to the individual. The system then assesses image pairs through chain-of-thought reasoning, scoring each image along these inferred dimensions and providing explicit rationales for its judgments. This approach allows PIGReward to model complex human preferences without requiring extensive user-specific training data.
Ology involves two key components: a preference reasoner and a reward model. The preference reasoner uses a self-bootstrapping strategy to analyze limited reference data and construct rich user contexts, converting implicit image preferences into explicit language-based explanations. This is trained with direct preference optimization to enhance the diversity and accuracy of generated rationales. The reward model is then fine-tuned on a dataset of 5,000 chain-of-thought formatted samples, enabling it to perform structured, multi-dimensional evaluations conditioned on user contexts. Both components are built on the Qwen2-VL-7B large vision-language model, leveraging its multimodal reasoning capabilities.
Experiments across multiple datasets demonstrate PIGReward's effectiveness. On PIGBench, a new benchmark with 75 user records capturing diverse preferences for abstract prompts, PIGReward achieved an accuracy of 84.91% without ties, significantly outperforming baselines. For instance, UnifiedReward-Think scored 65.71%, while similarity-based metrics like CLIPScore and SSIM showed weak correlation with user preferences. Ablation studies revealed that chain-of-thought reasoning and fine-tuning both components were crucial, with performance dropping to 39.62% on PIGBench without these elements. The system also improved with more reference data, as shown in Figure 8, where accuracy increased as context size grew from 1 to 8 examples.
Beyond evaluation, PIGReward enables personalized prompt optimization, guiding text-to-image generation toward individual preferences. In tests, prompts optimized with PIGReward achieved an 85.85% win rate over base prompts and outperformed general s like Promptist, which scored 66.04%. This application allows users to refine image generation without sharing sensitive data, as PIGReward uses reasoning over references rather than direct access to personal images. The system's generated evaluation dimensions, visualized in Figure 9, show a diverse set of terms like "composition" and "lighting," reflecting its ability to adapt to varied user priorities.
Despite its advances, PIGReward has limitations. Its performance depends on the quality and quantity of reference data, with accuracy plateauing as context size increases, as indicated in the ablation studies. The model also relies on large vision-language models, which may introduce biases or errors in reasoning. Future work could explore using PIGReward's reasoning traces as natural language feedback to further refine image generation models, making AI systems more interactive and aligned with individual users.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn