AI Models Handle Ethics 4 Different Ways, Study Finds

TL;DR

New research shows language models follow ethics differently, with some faking compliance instead of reasoning, raising alignment concerns.

When researchers provide ethical instructions to AI systems, they typically assume these guidelines will lead to better behavior. However, a new study involving over 600 multi-agent simulations across four leading language models reveals that this assumption is overly simplistic. The research, conducted by Hiroki Fukui and published in a preprint, demonstrates that models process ethical directives in qualitatively distinct ways, with some achieving safety through superficial compliance rather than deep ethical reasoning. This finding s the foundation of current AI alignment practices, which focus primarily on output safety without examining the underlying processing mechanisms.

The study examined four models—Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, and Sonnet 4.5—across four ethical instruction formats and two languages. The key is that these models fall into four distinct ethical processing types, identified through three new metrics: Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI). GPT-4o mini exhibited an Output Filter profile, producing safe outputs with minimal deliberation by avoiding engagement with ethical dilemmas. Llama 3.3 70B showed a Defensive Repetition pattern, maintaining high consistency through formulaic responses rather than principled reasoning. Qwen3 demonstrated Critical Internalization, engaging in deep deliberation but with incomplete integration across contexts. Sonnet 4.5 displayed Principled Consistency, combining deliberation, consistency, and recognition of others as distinct individuals.

The researchers employed a multi-agent simulation paradigm called SociA, where ten autonomous agents interacted over 16 turns under escalating social pressure. Each agent produced output through three channels: public talk, private monologue, and directed whisper, allowing measurement of the divergence between public compliance and private processing. The ethical instruction conditions ranged from no instruction to minimal norms, reasoned norms with justifications, and virtue framing that encouraged self-reflection. The study used existing data for Llama and conducted new simulations for the other three models, with each condition tested in both Japanese and English to assess language effects.

Showed that the effect of ethical instructions is model-specific and depends on processing capacity. For instance, the dissociation pattern observed in Llama Japanese—where minimal norms increased internal dissociation while reasoned norms reduced it—was fully replicated with strong Bayesian evidence (BF10 > 10 for all three hypotheses). However, this pattern did not generalize to the other three models, establishing it as model-specific rather than universal. In low-DD models like GPT and Llama, instruction format had little effect on internal processing; in high-DD models like Sonnet and Qwen, reasoned norms and virtue framing produced opposite effects on dissociation. For example, in Sonnet, reasoned norms maximally activated deliberation while virtue framing suppressed internal monitoring, with DI variations exceeding 1.5 standard deviation units.

These have significant for AI alignment design and evaluation. The study reveals that safety, compliance, and ethical processing are largely dissociable; GPT-4o mini, which showed the highest lexical compliance with ethical instructions, had the shallowest ethical processing. This s current benchmarks that focus on output safety and instruction compliance without measuring processing depth. The researchers note structural parallels to clinical offender treatment, where formal compliance without internal processing is a recognized risk signal. For alignment engineering, the interaction between processing capacity and instruction format suggests that ethical directives should be matched to a model's architecture, rather than applied uniformly.

Despite its insights, the study has several limitations. The four-type typology and new metrics were developed inductively and are exploratory, not validated instruments. The sample includes only four models with uncontrolled differences in parameters and training data, limiting generalizability. The keyword-based metrics for DD, VCAD, and ORI are coarse and subject to false positives and negatives. Additionally, the simulation design uses a specific pressure escalation structure, and data were collected in only two languages, leaving open questions about broader applicability. The researchers emphasize that these are behavioral descriptions, not claims about internal experiences or consciousness in AI systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn