AI Now Judges Images the Way Humans Do

TL;DR

A new fine-tuning method uses human feedback to improve image similarity scores, making AI more reliable for creative and practical tasks.

When using AI to generate images from text prompts, a common frustration arises: the computer's idea of what looks 'similar' often doesn't match our own. This misalignment can derail creative projects, educational tools, and even efforts to restore lost artwork. Researchers from the University of Oklahoma have developed a solution called CLPIPS, a customized image similarity metric that learns directly from human judgments, bridging the gap between algorithmic scores and subjective perception.

In a study involving 20 participants, the team found that their new , CLPIPS, significantly improves how well a computer metric aligns with human rankings of image similarity. Participants were tasked with iteratively refining text prompts to regenerate target images over 10 attempts, then ranking the resulting images from most to least similar visually. Using this human-generated data, the researchers fine-tuned an existing metric called LPIPS, which is widely used for perceptual comparisons. CLPIPS achieved a Spearman's rank correlation of 0.524 with human rankings, compared to 0.432 for the baseline LPIPS, indicating a stronger monotonic relationship where the metric's scores better reflect the order humans assign.

Ology centered on a lightweight fine-tuning approach that adjusts only the combination weights of the LPIPS framework, keeping its core visual features frozen to prevent overfitting. The team used a margin ranking loss function, which trains the model to correctly order image pairs based on human preferences—ensuring that images judged more similar by people receive lower distance scores from the metric. This process was applied to a dataset derived from the participants' rankings, with a 70/30 split for training and validation to monitor performance. The approach emphasizes rank-level alignment over raw score prediction, focusing on whether the metric can reproduce the sequence humans create when comparing images.

From the evaluation show consistent improvements across key metrics. CLPIPS achieved an Intraclass Correlation Coefficient (ICC) of 0.68, up from 0.60 for baseline LPIPS, indicating better agreement in absolute rankings between the metric and human raters. According to established guidelines, this shift moves the alignment from 'fair' to 'good' in some interpretations. Statistical tests confirmed the significance of these gains, with p-values far below 0.001 for both correlation measures. A paired bootstrap analysis further demonstrated that the improvement is robust across different target image sets, with a confidence interval for the ICC increase ranging from 0.071 to 0.1, suggesting generalizes well beyond the training data.

Of this work extend to various real-world applications where human-AI collaboration is essential. For instance, in creative restoration tasks, users could rely on CLPIPS to guide prompt adjustments that more accurately approximate damaged visual artifacts. In educational settings, it could help novice users learn prompt crafting by providing feedback that mirrors human intuition. The study also highlights potential for on-the-fly personalization in interactive systems, allowing metrics to adapt to individual user preferences during live workflows, though this was not explored in the current research.

Despite its advances, CLPIPS has limitations. The ICC value of 0.68, while improved, still indicates room for better alignment, possibly due to inherent noise in human judgment or the constraints of a single model averaging diverse user preferences. The study relied on a controlled dataset of 2000 image sets from a specific prompt refinement task, which may not fully capture the variability of broader visual domains. Future work could scale the approach with larger datasets, test generalization to unseen images, and explore dynamic adaptation for individual users in real-time applications.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn