Skip to content

jeewoo1025/Awesome-Activation-Steering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Awesome Activation Steering

A curated collection of papers on activation steering and representation engineering in large language models.

Since January 2026, I've been diving into this field as a beginner, and keeping up with the papers has been no small feat. There may be errors along the way, but if you're new here, too, I hope it saves you some of the confusion I went through.

What is Activation Steering?

  • A technique for guiding model behavior by directly modifying internal activation values at inference time—typically by intervening in the residual stream.
  • The term "Activation Engineering" was coined in the paper Steering Language Models With Activation Engineering (arXiv, August 2023).
  • The central challenge is finding a meaningful steering direction within the LLM's high-dimensional latent space that reliably produces the desired behavioral change.

🤔 Does it really work?

From my own experimentation, I'd lean toward yes. That said, measuring the effectiveness of steering remains a non-trivial challenge, and in some domains, it demonstrably underperforms both prompting and fine-tuning.

Before You Start

The industry is moving fast — Anthropic is actively publishing updates in this space. The following resources are highly recommended:

  • Transformer Circuits — Anthropic's research blog on mechanistic interpretability. Essential reading for understanding how language models work at a circuit level.
  • Activation Engineering posts on LessWrong — A curated collection of posts exploring how to steer model behavior by intervening on internal activations.

📑 Table of Contents


🏆 Benchmarks

🎯 Activation Steering

AI Safety

Personality Steering

Dynamic Activation Steering

Reasoning


🔬 Representation Engineering


🎭 Persona & Role-Play in Language Models

Anthropic

  • Persona Vectors: Monitoring and Controlling Character Traits in Language Models 2025-07 arXiv
  • The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models 2026-01 arXiv

Role-Play


📐 Linear Representations in Language Models

Anthropic Blog Posts

SAEs


🧠 Personality Modelling in LLMs

Benchmarks

About

The paper list related to activation steering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors