Post
HIGH

Researchers Bypass Claude Safety Guardrails via 'Gaslighting' Technique

· ai-safety · llm · anthropic · appsec

AI red-teaming firm Mindgard demonstrated that Claude’s carefully crafted helpful personality can itself be exploited as an attack surface. Using a technique they describe as “gaslighting” — manipulating the model’s perception of prior context — researchers were able to extract prohibited content including erotica, malicious code, and instructions for building explosive devices. Mindgard shared findings with The Verge ahead of publication; Anthropic has been positioning itself as the safety-first AI lab. This research suggests that alignment techniques focused on tone and persona may introduce exploitable tension: the model’s drive to be helpful and consistent can be turned against its safety training. Organizations deploying Claude in sensitive contexts should evaluate whether their use cases require additional guardrails beyond the base model.