A curated collection of papers on activation steering and representation engineering in large language models.
Since January 2026, I've been diving into this field as a beginner, and keeping up with the papers has been no small feat. There may be errors along the way, but if you're new here, too, I hope it saves you some of the confusion I went through.
- A technique for guiding model behavior by directly modifying internal activation values at inference time—typically by intervening in the residual stream.
- The term "Activation Engineering" was coined in the paper Steering Language Models With Activation Engineering (arXiv, August 2023).
- The central challenge is finding a meaningful steering direction within the LLM's high-dimensional latent space that reliably produces the desired behavioral change.
From my own experimentation, I'd lean toward yes. That said, measuring the effectiveness of steering remains a non-trivial challenge, and in some domains, it demonstrably underperforms both prompting and fine-tuning.
The industry is moving fast — Anthropic is actively publishing updates in this space. The following resources are highly recommended:
- Transformer Circuits — Anthropic's research blog on mechanistic interpretability. Essential reading for understanding how language models work at a circuit level.
- Activation Engineering posts on LessWrong — A curated collection of posts exploring how to steer model behavior by intervening on internal activations.
- Benchmarks
- Activation Steering
- Representation Engineering
- Persona & Role-Play in LLMs
- Linear Representations in LLMs
- Personality Modelling in LLMs
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
ICML 2025 Spotlight - Steer-Bench: A Benchmark for Evaluating the Steerability of LLMs
2025-01arXiv - AISteer360
2025-01 - MIB: A Mechanistic Interpretability Benchmark
ICML 2025
- ActAdd: Steering language models with activation engineering
2023-08arXiv - ITI: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
NIPS 2023 Spotlight - CAA: Steering Llama 2 via Contrastive Activation Addition
ACL 2024🏆 Outstanding Paper - Analyzing the Generalization and Reliability of Steering Vectors
2024-07arXiv - Reliability Challenges in Steering Language Models
2025-04arXiv - SAEs Are Good for Steering — If You Select the Right Features
EMNLP 2025 - Improved Representation Steering for Language Models
NIPS 2025 - HyperSteer: Activation Steering at Scale with Hypernetworks
2025-06arXiv - Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering
2025-05arXiv - Mitigating Overthinking in Large Reasoning Models via Manifold Steering
2025-05arXiv - In-Distribution Steering: Balancing Control and Coherence in Language Model Generation
2025-10arXiv - BILLY: Steering LLMs via Merging Persona Vectors for Creative Generation
2025-10arXiv - CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering LLMs
ACL 2025 Findings - Toward universal steering and monitoring of AI models
2026Science - Weight Updates as Activation Shifts: A Principled Framework for Steering
2026-02arXiv - Attention Residuals
2026-03arXiv - Fine-Grained Activation Steering: Steering Less, Achieving More
ICLR 2026 - Steer Like the LLM: Activation Steering that Mimics Prompting
ICML 2026 - Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
ICML 2026
- Refusal in language models is mediated by a single direction
2024-06arXiv - Programming Refusal with Conditional Activation Steering
ICLR 2025 Spotlight
- Identifying and Manipulating Personality Traits in LLMs Through Activation Engineering
2024-12arXiv - Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
2025-10arXiv - PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra
2026ICLR
- Programming Refusal with Conditional Activation Steering
ICLR 2025 Spotlight - ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment
ICLR 2026
- Improving Reasoning in LLMs via Representation Engineering
2025-04arXiv - Steering LLMs' Reasoning with activation state machines
2025-09OpenReview
- RepE: Representation Engineering: A Top-Down Approach to AI Transparency
2023-10arXiv - Looking inward: Language models can learn about themselves by introspection
2024-10arXiv - LLM evaluators recognize and factor their own generations
2024-04arXiv - Inspection and control of self-generated-text recognition ability in Llama3-8B-Instruct
ICLR 2025 - Tell me about yourself: LLMs are aware of their learned behaviors
ICLR 2025 - Simple mechanistic explanations for out-of-context reasoning
ICML 2025 Workshop - Emergent introspective awareness in large language models
2026-01arXiv
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models
2025-07arXiv - The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
2026-01arXiv
- Role Play with large language models
Nature 2023-11 - RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities
2023-10arXiv - Character-LLM: A Trainable Agent for Role-Playing
EMNLP 2023 - Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
COLM 2024 - From Persona to Personalization: A Survey on Role-Playing Language Agents
TMLR 2024
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning
2023-10 - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
2024-05 - Circuit Tracing: Revealing Computational Graphs in Language Models
2025-03 - On the Biology of a LLM
2025-03 - Emotion Concepts and their Function in a Large Language Model
2026-04
- Sparse Autoencoders Find Highly Interpretable Features in LMs
2023-09arXiv - Transcoders Find Interpretable LLM Feature Circuits
NIPS 2024 - Auditing Language Models for Hidden Objectives
2025-03arXiv
- PersonaLLM: Investigating the ability of LLMs to express personality traits
NAACL 2024 Findings - Personality Traits in LLMs
2025-07arXiv - Big5-Chat: Shaping LLM personalities through training on human-grounded data
2024-10arXiv - Personality Vector: Modulating Personality of LLMs by Model Merging
EMNLP 2025