Skip to main content

Anthropic Reverses Course on Fable Safety

Anthropic has committed to making its previously covert safeguard, designed to prevent model distillation, as transparent and visible as its other est

3 min read13 views5 tags
Originally reported bytheverge

Anthropic has committed to making its previously covert safeguard, designed to prevent model distillation, as transparent and visible as its other established safety measures.

The company has issued an apology for discreetly implementing hidden guardrails within its new AI model, Claude Fable 5. These clandestine restrictions had the unintended consequence of hindering both researchers and rivals who were utilizing the model to develop competing systems. Anthropic has now stated its intention to reverse this policy, promising greater transparency regarding when these restrictions are activated, even if this leads to Fable declining more user queries.

Claude Fable 5 marks the first widely available model within Anthropic’s 'Mythos' class of AI systems, a category the company had previously spent months cautioning was too dangerous for public release. Anthropic explained that it addressed some of these inherent risks by launching Fable with safeguards designed to prevent it from responding to certain “high-risk” inquiries. Among the areas Anthropic explicitly stated it would restrict Fable’s capabilities was "distillation," a technique involving the training of smaller AI models using the generated outputs of larger, more sophisticated ones.

According to Fable’s publicly released system card—a standard document AI developers issue to detail a system's functionality—Anthropic had initially outlined its approach to suspected distillation attempts: directly altering and degrading the model’s generated answers. Crucially, users were not to be informed when this safety measure was activated, nor were they notified that the model's responses had been modified.

Anthropic has announced a revised strategy for handling distillation queries. The company communicated via a post on X (formerly Twitter) that such requests will now be rerouted to Claude Opus 4.8, Anthropic’s preceding flagship model. Furthermore, Anthropic pledges to provide clear user notification, stating: “You will see this every time it happens.”

This new approach mirrors Fable's existing handling of inquiries in other high-risk domains. For instance, when safety protocols are activated in fields such as biology, chemistry, or cybersecurity, queries are typically routed through Opus 4.8, unless they are entirely blocked under Anthropic’s overarching safety regulations concerning prohibited content like drugs or weapons. Anthropic conceded in a statement to The Verge that, particularly in biology, these safeguards were so broadly calibrated that Fable became virtually impractical for even fundamental queries.

Anthropic elaborated on its initial decision, stating: “Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.”

This policy reversal comes in the wake of considerable criticism from the AI research community, which reacted strongly to Anthropic’s practice of silently restricting users suspected of attempting to distill Fable into rival models. Critics argued that such a safeguard could inadvertently impact third parties engaged in evaluating this frontier model. Anthropic had previously justified its stance in the system card, asserting that the potential of newer models to accelerate AI development warranted targeting such requests, and explicitly noting that “using Claude to develop competing models already violates our Terms of Service.” The company has also previously leveled accusations against Chinese competitors, such as DeepSeek, for what it described as "industrial" scale, unfair distillation of its models.

#AI News#Anthropic#Claude Fable 5#Hidden safeguards#Distillation
ES
Editorial StaffEditor

The Editorial Staff at AIChief is a team of professional content writers with extensive experience in AI and marketing. Founded in 2025, AIChief has quickly grown into the largest free AI resource hub in the industry.

View all posts
Reader feedback

What did you think of this story?

User Comments

Filter:
No comments yet. Be the first to comment!
Continue reading
View all news