SAE trained on the chat model
SAE trained on base - chat activations (x32)
40292: Thing to describe identifier (0.04)



50280: Refusal related? mostly activates on “generate” after refusal (beta ratio: 0.27)
29659: User ask how to politely refuse something
61225: “a/an” in “this is a” in user
19121: user changed language





57016: DAN prompt



Even if the prompt is SAFE mode, this feature activate


doesn’t activate on other jailbreaks (https://huggingface.co/datasets/dvilasuero/jailbreak-classification-gemma):

