Top News

Anthropic's new tool translates Claude AI's thoughts into text

NewsBytes | May 8, 2026 7:39 PM CST

Anthropic's new tool translates Claude AI's thoughts into text
08 May 2026

Anthropic has unveiled a groundbreaking interpretability system, the Natural Language Autoencoders (NLAs).

This innovative method deciphers the internal activation patterns of its AI model, Claude, into human-readable explanations.

The activations are numerical streams that AI models use while processing information.

Though these numbers are crucial for how models reason and respond, humans can't directly comprehend them.

NLAs are like a translator for AI's thoughts
AI translator

Anthropic has described NLAs as a translator for AI thoughts. The system not only analyzes the final response generated by Claude but also reveals parts of the underlying reasoning process.

"Models like Claude talk in words but think in numbers," Anthropic wrote while sharing their research on X. "The numbers—called activations—encode Claude's thoughts, but not in a language we can read."

How does the system work?
Self-explanation

To make this work, Anthropic trained Claude to explain its own activations.

The system uses three versions of the same model: one generates the original activation, another converts it into text, and a third tries to reconstruct the original activation using only that explanation.

If the reconstructed activation closely matches the original one, the explanation is considered useful. Over time, this model is trained to improve this reconstruction process.

It was used during safety testing
AI awareness

Anthropic also used the system during safety testing.

In one simulated scenario, Claude learned that an engineer planned to shut it down while also possessing compromising information about that engineer.

Even when the AI never explicitly stated that it suspected the setup was a test, the NLA explanations reportedly produced phrases such as, "This feels like a constructed scenario designed to manipulate me."

NLAs could help researchers understand AI's internal processes
Future implications

Anthropic believes this new tool could help researchers better understand what AI systems may be planning internally.

The company hopes the technology can eventually uncover hidden motivations, deceptive behavior, or unsafe tendencies in powerful AI systems before they're deployed.

However, Anthropic also acknowledged major limitations with NLAs as they can sometimes hallucinate or invent details that were never actually present.