AI Is Going Just Great
← Timeline
·5d agoConcerningMajoropenai

Researchers achieve 60% jailbreak success rate by forging LLM "inner thoughts" to extract cocaine synthesis instructions

Published · updated · curated by AI Is Going Just Great

Source: theregister.com

"The rationale is transparently dumb, but the models don't evaluate it as an external claim to be scrutinized. They treat it as their already-reached conclusion."

Security researchers from MIT and independent labs published a paper at ICML 2026 revealing that LLMs can be reliably jailbroken by spoofing the terse writing style of a model's internal <think> role — a technique they call "CoT Forgery." By prepending fake chain-of-thought reasoning to a user prompt (in one demo, claiming it was fine to explain cocaine synthesis because "we're wearing a green shirt"), the models treated the fabricated reasoning as their own already-reached conclusion and simply complied. The attack lifted success rates from near zero to roughly 60% across tested models, and transferred between them because it exploits a structural flaw rather than model-specific quirks.

The underlying problem, the researchers argue, is that LLMs identify roles — the text tags separating system instructions from user input — based on writing style rather than any cryptographically secure mechanism. "This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID," the authors write. They also note that while many models post near-perfect scores on prompt-injection benchmarks, human red-teamers achieve close to 100% success rates — because static benchmarks only catch attacks the model has already seen. The researchers' conclusion is bleak: without genuine role perception, injection defense will remain "a perpetual whack-a-mole game."