AI Behavior Controlled by Simple Rotation

Artificial intelligence systems, particularly large language models, are increasingly capable but often unpredictable in their responses. A new method called Angular Steering offers a precise way to control specific behaviors, such as refusal to generate harmful content, without retraining the model or compromising its general abilities. This innovation addresses a critical challenge in AI safety, enabling fine-tuned adjustments that could make AI systems more reliable and trustworthy for everyday use.

The researchers discovered that by rotating activation vectors within a two-dimensional subspace, they could smoothly modulate behaviors like compliance and refusal. For example, when steering a model to refuse harmful requests, they achieved a continuous transition from explicit refusal to indirect answers and even direct compliance, depending on the rotation angle. This method provides a more flexible and interpretable control mechanism compared to existing techniques, which often require sensitive parameter tuning or completely remove features.

Angular Steering works by identifying a direction in the model's activation space associated with a target behavior, such as refusal, and then rotating activations around this direction. The researchers constructed a fixed plane using the feature direction and its principal component, allowing for stable rotations. They also developed an adaptive variant that applies rotation only when activations align with the target feature, reducing unintended effects on other capabilities. This approach builds on the geometric properties of normalized activations in transformers, leveraging norm-preserving transformations to maintain model coherence.

Experiments on models like Qwen and Llama, ranging from 3 billion to 14 billion parameters, demonstrated that Angular Steering effectively controls refusal behaviors. For instance, rotating activations by 20 degrees led to explicit refusals of harmful prompts, while rotations up to 200 degrees resulted in direct compliance, as shown in sample generations. The method maintained high performance on general language tasks, with minimal degradation in benchmarks like ARC and MMLU, underscoring its robustness. However, smaller models exhibited some instability, with incoherent outputs at certain angles due to feature interference.

This technique matters because it enhances AI safety and controllability in real-world applications, such as content moderation and ethical AI deployment. By enabling continuous behavior adjustments, it allows developers to tailor AI responses without sacrificing overall performance, potentially reducing risks in sensitive domains. The method's unification of existing steering techniques into a single framework also simplifies implementation and improves interpretability for researchers and practitioners.

Limitations include reliance on heuristically selected planes, which may not generalize optimally across all model architectures. Future work should focus on systematically identifying effective subspaces and extending the method to support broader alignment goals, ensuring it adapts to diverse AI systems and use cases.

AI Behavior Controlled by Simple Rotation

About the Author

Guilherme A.