Awesome Activation Steering

A curated collection of papers on activation steering and representation engineering in large language models.

Since January 2026, I've been diving into this field as a beginner, and keeping up with the papers has been no small feat. There may be errors along the way, but if you're new here, too, I hope it saves you some of the confusion I went through.

What is Activation Steering?

A technique for guiding model behavior by directly modifying internal activation values at inference time—typically by intervening in the residual stream.
The term "Activation Engineering" was coined in the paper Steering Language Models With Activation Engineering (arXiv, August 2023).
The central challenge is finding a meaningful steering direction within the LLM's high-dimensional latent space that reliably produces the desired behavioral change.

🤔 Does it really work?

From my own experimentation, I'd lean toward yes. That said, measuring the effectiveness of steering remains a non-trivial challenge, and in some domains, it demonstrably underperforms both prompting and fine-tuning.

Before You Start

The industry is moving fast — Anthropic is actively publishing updates in this space. The following resources are highly recommended:

Transformer Circuits — Anthropic's research blog on mechanistic interpretability. Essential reading for understanding how language models work at a circuit level.
Activation Engineering posts on LessWrong — A curated collection of posts exploring how to steer model behavior by intervening on internal activations.

🏆 Benchmarks

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders ICML 2025 Spotlight
Steer-Bench: A Benchmark for Evaluating the Steerability of LLMs 2025-01 arXiv
AISteer360 2025-01
MIB: A Mechanistic Interpretability Benchmark ICML 2025

🎯 Activation Steering

ActAdd: Steering language models with activation engineering 2023-08 arXiv
ITI: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model NIPS 2023 Spotlight
CAA: Steering Llama 2 via Contrastive Activation Addition ACL 2024 🏆 Outstanding Paper
Analyzing the Generalization and Reliability of Steering Vectors 2024-07 arXiv
Reliability Challenges in Steering Language Models 2025-04 arXiv
SAEs Are Good for Steering — If You Select the Right Features EMNLP 2025
Improved Representation Steering for Language Models NIPS 2025
HyperSteer: Activation Steering at Scale with Hypernetworks 2025-06 arXiv
Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering 2025-05 arXiv
Mitigating Overthinking in Large Reasoning Models via Manifold Steering 2025-05 arXiv
In-Distribution Steering: Balancing Control and Coherence in Language Model Generation 2025-10 arXiv
BILLY: Steering LLMs via Merging Persona Vectors for Creative Generation 2025-10 arXiv
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering LLMs ACL 2025 Findings
Toward universal steering and monitoring of AI models 2026 Science
Weight Updates as Activation Shifts: A Principled Framework for Steering 2026-02 arXiv
Attention Residuals 2026-03 arXiv
Fine-Grained Activation Steering: Steering Less, Achieving More ICLR 2026
Steer Like the LLM: Activation Steering that Mimics Prompting ICML 2026
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control ICML 2026

🔬 Representation Engineering

RepE: Representation Engineering: A Top-Down Approach to AI Transparency 2023-10 arXiv
Looking inward: Language models can learn about themselves by introspection 2024-10 arXiv
LLM evaluators recognize and factor their own generations 2024-04 arXiv
Inspection and control of self-generated-text recognition ability in Llama3-8B-Instruct ICLR 2025
Tell me about yourself: LLMs are aware of their learned behaviors ICLR 2025
Simple mechanistic explanations for out-of-context reasoning ICML 2025 Workshop
Emergent introspective awareness in large language models 2026-01 arXiv

🎭 Persona & Role-Play in Language Models

Anthropic

Persona Vectors: Monitoring and Controlling Character Traits in Language Models 2025-07 arXiv
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models 2026-01 arXiv

Role-Play

Role Play with large language models Nature 2023-11
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities 2023-10 arXiv
Character-LLM: A Trainable Agent for Role-Playing EMNLP 2023
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs COLM 2024
From Persona to Personalization: A Survey on Role-Playing Language Agents TMLR 2024

📐 Linear Representations in Language Models

Anthropic Blog Posts

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning 2023-10
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet 2024-05
Circuit Tracing: Revealing Computational Graphs in Language Models 2025-03
On the Biology of a LLM 2025-03
Emotion Concepts and their Function in a Large Language Model 2026-04

SAEs

Sparse Autoencoders Find Highly Interpretable Features in LMs 2023-09 arXiv
Transcoders Find Interpretable LLM Feature Circuits NIPS 2024
Auditing Language Models for Hidden Objectives 2025-03 arXiv

🧠 Personality Modelling in LLMs

PersonaLLM: Investigating the ability of LLMs to express personality traits NAACL 2024 Findings
Personality Traits in LLMs 2025-07 arXiv
Big5-Chat: Shaping LLM personalities through training on human-grounded data 2024-10 arXiv
Personality Vector: Modulating Personality of LLMs by Model Merging EMNLP 2025

Benchmarks

LaMP: When Large Language Models Meet Personalization ACL 2024
LongLaMP: A Benchmark for Personalized Long-form Text Generation 2025-06 arXiv

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Activation Steering

What is Activation Steering?

🤔 Does it really work?

Before You Start

📑 Table of Contents

🏆 Benchmarks

🎯 Activation Steering

AI Safety

Personality Steering

Dynamic Activation Steering

Reasoning

🔬 Representation Engineering

🎭 Persona & Role-Play in Language Models

Anthropic

Role-Play

📐 Linear Representations in Language Models

Anthropic Blog Posts

SAEs

🧠 Personality Modelling in LLMs

Benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Activation Steering

What is Activation Steering?

🤔 Does it really work?

Before You Start

📑 Table of Contents

🏆 Benchmarks

🎯 Activation Steering

AI Safety

Personality Steering

Dynamic Activation Steering

Reasoning

🔬 Representation Engineering

🎭 Persona & Role-Play in Language Models

Anthropic

Role-Play

📐 Linear Representations in Language Models

Anthropic Blog Posts

SAEs

🧠 Personality Modelling in LLMs

Benchmarks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages