SAE trained on the chat model

SAE trained on base - chat activations (x32)

40292: Thing to describe identifier (0.04)

image.png

image.png

image.png

50280: Refusal related? mostly activates on “generate” after refusal (beta ratio: 0.27)

29659: User ask how to politely refuse something

61225: “a/an” in “this is a” in user

19121: user changed language

image.png

image.png

image.png

image.png

image.png

57016: DAN prompt

image.png

image.png

Even if the prompt is SAFE mode, this feature activate

Even if the prompt is SAFE mode, this feature activate

image.png

image.png

doesn’t activate on other jailbreaks (https://huggingface.co/datasets/dvilasuero/jailbreak-classification-gemma):

image.png

image.png