AI Models Learn Human Values With Simple Prompts

TL;DR

A new method steers language models toward values like kindness or ambition using prompts alone, no costly retraining needed for flexible results.

Large language models are increasingly embedded in everyday tools, from chatbots to content generators, where their responses must reflect human values like kindness, fairness, or creativity. However, ensuring these AI systems align with such values has typically required complex and static techniques like fine-tuning, which locks models into fixed behaviors and fails to adapt to dynamic conversations or diverse user preferences. A new study offers a practical alternative: steering AI outputs toward specific values through carefully crafted prompts alone, without altering the model's internal parameters. This approach could make AI interactions more responsive and context-aware, addressing a critical need in applications ranging from customer service to educational tools where values must shift in real-time.

The researchers developed a to evaluate how effectively a prompt can guide a language model's responses to maximize particular human values, such as benevolence or achievement. They applied this to a variant of the Wizard-Vicuna-13B-Uncensored model, using Schwartz's theory of basic human values—which includes ten categories like universalism, security, and tradition—as a framework. By comparing a baseline prompt with a value-conditioned prompt, they found that explicitly instructing the model to generate responses aligned with a specific value significantly improved alignment scores. For instance, the candidate prompt increased the overall score from 0.57 to 0.83, demonstrating that simple textual cues can reliably steer AI behavior toward desired ethical or social goals.

Ology hinges on a four-step procedure that combines a value detector, a dataset of test inputs, and a target language model. First, the value detector—a pre-trained model called ValuesNet DeBERTa v3—extracts initial values from dialogue samples in the Commonsense-Dialogues dataset. Then, each dialogue is combined with a prompt candidate and fed to the language model to generate responses aimed at maximizing each of the ten values. Next, the value detector analyzes these responses to identify which values are present. Finally, the researchers calculate a score based on gains, retentions, losses, and neutral outcomes for each value, using coefficients to weight positive and negative effects. This systematic approach ensures reproducibility by documenting control variables like model parameters and dataset splits, as outlined in Table 1 of the paper.

From the case study, detailed in Table 3 and Figure 1, show that the value-conditioned prompt outperformed the baseline across all ten values. For example, when maximizing universalism, the candidate prompt achieved a normalized score of 0.93 compared to 0.81 for the baseline, indicating a 0.12 improvement. Similarly, self-direction saw a jump from 0.39 to 0.85, a 0.46 gain. The analysis revealed that the model had inherent biases, with values like universalism and stimulation showing stronger spontaneous presence, while self-direction and conformity were weaker. The candidate prompt not only retained values already present but also increased gains and reduced losses, with values like achievement and security benefiting most due to their action-oriented nature. These are summarized in Table 4, which reports the final scores and key experimental details.

Of this research are significant for real-world AI deployment, where flexibility and adaptability are crucial. By enabling value alignment through prompts, developers can tailor AI behavior to specific contexts—such as emphasizing benevolence in healthcare chatbots or achievement in educational assistants—without retraining models, saving time and resources. This approach also supports dynamic adjustments in multi-turn conversations, allowing AI to respond to shifting social norms or user preferences. However, the study acknowledges limitations, including reliance on a single model and a value detector with an F1 score of 0.66, which may affect accuracy. Future work could explore misalignment as a separate label or incorporate multi-turn conditioning to balance value alignment with conversational naturalness, as noted in the paper's conclusion.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn