We present Human Preference Dataset v3 (HPDv3), a comprehensive dataset containing 1.17 million binary preference choices across 1.08 million images, grouped in pairs by prompt.
Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Our contributions provide a robust benchmark for full-spectrum evaluation and introduce a scalable, human-aligned approach to improving image generation quality. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality.
Models | All | Characters | Arts | Design | Architecture | Animals | Natural Scenery | Transportation | Products | Plants | Food | Science | Others |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Kolors | 10.55 | 11.79 | 10.47 | 9.87 | 10.82 | 10.60 | 9.89 | 10.68 | 10.93 | 10.50 | 10.63 | 11.06 | 9.51 |
Flux-dev | 10.43 | 11.70 | 10.32 | 9.39 | 10.93 | 10.38 | 10.01 | 10.84 | 11.24 | 10.21 | 10.38 | 11.24 | 9.16 |
Playground-v2.5 | 10.27 | 11.07 | 9.84 | 9.64 | 10.45 | 10.38 | 9.94 | 10.51 | 10.62 | 10.15 | 10.62 | 10.84 | 9.39 |
Infinity | 10.26 | 11.17 | 9.95 | 9.43 | 10.36 | 9.27 | 10.11 | 10.36 | 10.59 | 10.08 | 10.30 | 10.59 | 9.62 |
CogView4 | 9.61 | 10.72 | 9.86 | 9.33 | 9.88 | 9.16 | 9.45 | 9.69 | 9.86 | 9.45 | 9.49 | 10.16 | 8.97 |
PixArt-Σ | 9.37 | 10.08 | 9.07 | 8.41 | 9.83 | 8.86 | 8.87 | 9.44 | 9.47 | 9.52 | 9.73 | 10.35 | 8.58 |
Gemini 2.0 Flash | 9.21 | 9.98 | 8.44 | 7.64 | 10.11 | 9.42 | 9.01 | 9.74 | 9.64 | 9.55 | 10.16 | 7.61 | 9.23 |
SDXL | 8.20 | 8.67 | 7.63 | 7.53 | 8.57 | 8.18 | 7.76 | 8.65 | 8.85 | 8.32 | 8.43 | 8.78 | 7.29 |
HunyuanDit | 8.19 | 7.96 | 8.11 | 8.28 | 8.71 | 7.24 | 7.86 | 8.33 | 8.55 | 8.28 | 8.31 | 8.48 | 8.20 |
SD3-Medium | 5.31 | 6.70 | 5.98 | 5.15 | 5.25 | 4.09 | 5.24 | 4.25 | 5.71 | 5.84 | 6.01 | 5.71 | 4.58 |
SD2 | -0.24 | -0.34 | -0.56 | -1.35 | -0.24 | -0.54 | -0.32 | 1.00 | 1.11 | -0.01 | -0.38 | -0.38 | -0.84 |
Chain-of-Human-Preference (CoHP) is an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step.
We employ DanceGRPO as the reinforcement learning algorithm for image generation and use HPSv3 as the reward model to refine SD1.4.
@misc{ma2025hpsv3widespectrumhumanpreference,
title={HPSv3: Towards Wide-Spectrum Human Preference Score},
author={Yuhang Ma and Xiaoshi Wu and Keqiang Sun and Hongsheng Li},
year={2025},
eprint={2508.03789},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.03789},
}