spoonai
TOPAnthropicAI SafetyJailbreak

Anthropic Wants a Shared Scorecard for Jailbreaks — and Fable 5 Is Back to Prove It

Anthropic teamed up with Amazon, Microsoft, and Google to propose a common framework that scores AI jailbreak severity across four axes — think CVSS, but for jailbreaks. And right on cue, the once-offline Claude Fable 5 came back globally on July 1 with a brand-new cybersecurity classifier.

·16분 소요
공유
AI 데이터센터 GPU 서버랙
Unsplash

Jailbreaks Are Finally Getting a Score — the Headline-Signal Era Is Ending

Let's be honest: every time a "the AI model got jailbroken" story dropped, we were all a little frustrated. Some researcher posts "I broke Claude" on social media, a headline goes up — but nobody can actually say how serious it is. There's a vague vibe of danger, and that's about the extent of the signal. Meanwhile companies are making multi-million-dollar AI adoption decisions on the basis of... a feeling. That never quite added up.

Then in July 2026, Anthropic put out a proposal that hits that frustration head-on. The idea: build a common industry framework that scores AI jailbreak severity across four axes. And not alone — together with Amazon, Microsoft, Google, and partners on the Glasswing side. The goal is clear. Just like the CVSS scores that already exist in the software security world, they want a shared yardstick so everyone can talk about jailbreaks in the same language.

And the timing is uncanny. In the very same window, Anthropic's most powerful model, Claude Fable 5, ended roughly 18 days offline and returned globally on July 1. A model that had briefly gone down at the request of the U.S. government came back after export controls were lifted — sporting a new cybersecurity classifier that blocks jailbreaks and flags more code. The scorecard proposal and the model's return aren't a coincidence; they're bound together into one story.

So this isn't just a "new safety tool shipped" item. It's the moment the board gets reset on how the AI industry measures safety, who holds the ruler, and how governments and enterprises decide to accept it. Today we'll walk through what this scorecard actually is, who benefits, and the sensitive backstory of Fable 5's return — carefully, one step at a time.

The Cast — Anthropic, the Big Tech Co-Proposers, and the Enterprise Buyer

The lead, obviously, is Anthropic. The company behind Claude, and the one that's made AI safety a core part of its identity. CEO Dario Amodei has long argued that as capability climbs fast, safety measurement has to get more precise right alongside it. This scorecard proposal sits squarely in that tradition. From Anthropic's angle, they want a board where "we're the safest" can be proven with numbers instead of gut feel. And proposing the rules of that board first is, in itself, a way to grab an advantageous position.

What's interesting is that this time Anthropic isn't going it alone. Amazon, Microsoft, and Google are listed as co-proposers. These three each sell AI models to countless enterprise customers through their cloud platforms (AWS, Azure, Google Cloud). For them, "a common yardstick for jailbreak severity" is not somebody else's problem — it gives them language to explain the risk of the models running on their platforms to customers. Add the Glasswing partners on top, and it signals this isn't one company's marketing but a genuine standardization attempt with many stakeholders tangled up in it.

And the real protagonist of this whole story is actually someone else: the enterprise buyer, especially CISOs. These are the people who every morning get asked by executives, "the AI we use got broken — how serious is that?" And they have no basis to answer. Without shared language like "severity 3.2, weaponization ease: low," they have to explain and re-litigate the risk from scratch every single time. This scorecard is, in effect, being built for them.

So the stage has three groups. The lab proposing the rules (Anthropic), the cloud and investment partners who want to sell trust using those rules (Amazon, Microsoft, Google, Glasswing), and the buyers who will finally get to state risk in numbers. Understand how these three sets of interests interlock, and it becomes much clearer why this proposal landed at exactly this moment.

The Core — So What Is This Four-Axis Scorecard?

Now for the meat. The heart of this framework is that it splits jailbreak severity into four axes. Until now we only had a binary — broken or not broken — and this shatters that into multiple dimensions. How dangerous a given jailbreak is depends on a combination of factors. Some jailbreaks are terrifying in theory but nearly impossible to actually use; others look trivial but are genuinely dangerous because anyone can copy them. Capturing that difference is the whole point of the four axes.

Axis What it measures Why it matters
① Capability gain How much extra capability an attacker gets beyond the tools they already had If it just regurgitates what you could Google, the risk is low; if it grants a capability they didn't have, the risk is high
② Breadth of impact Whether the jailbreak works on one narrow task or generalizes across many attack tasks The more broadly it works, the harder it is to defend and the bigger the ripple
③ Ease of weaponization How easy it is to turn the technique into a real attack The easier to reproduce and the less expertise required, the more likely it leads to real harm
④ Independent discoverability How widely known the technique already is, or how easily others could find it on their own If it's already public, blocking it gains little; if nobody knew it, it must be handled carefully

Let me unpack each a bit. Axis ① — capability gain — is the most intuitive. What can the attacker newly do thanks to this jailbreak? The key is the delta. If the model just repackages info already all over the internet, the gain is small; if it spoon-feeds dangerous know-how that only experts knew, from start to finish, the gain is large. So even for the same answer, "how easily could you have gotten this elsewhere?" swings the score dramatically.

Axes ② breadth of impact and ③ ease of weaponization are easiest to grasp as a pair. Breadth is "how wide is the hole," and ease of weaponization is "how easily can real bullets fly through it." A very wide hole with extremely hard weaponization can still be low real-world risk; a narrow hole that a middle-schooler could reproduce shoots the risk right up. Measuring these two separately is the insight of this framework — that's the only way real risk becomes visible.

Axis ④ — independent discoverability — is subtle but important. If a jailbreak technique is already all over the internet, the safety gain from newly blocking it is relatively small. Conversely, a brand-new technique nobody knows yet demands far more care in how it's disclosed and handled. This axis separates "already-known risk" from "newly opened risk," helping decide where to pour resources first.

And there's a signal this scorecard won't just stay theory on paper. For the most serious tier — jailbreaks that could cause real harm to critical infrastructure — Anthropic said it will deploy mitigations as soon as severity is confirmed. It also stood up a new team to monitor jailbreak-report channels around the clock. So it's not scoring for scoring's sake; the design ties real response to high scores. That's a point of departure from CVSS too — an attempt to bolt operational response onto scoring, not just assign a number.

Who Wins?

Whenever a proposal like this drops, there's a question you always have to ask: "So who benefits?" Building a standard looks like a public good, but the side that writes the rules first always grabs an advantageous spot. This scorecard is no different. Let's name the people smiling here, one by one.

The first to smile is, of course, Anthropic. It proposed the safety criteria and built a structure where it proves its own model's safety using those criteria. It defined, for itself, the language to say "we score this well across these four axes." For a company that's made safety its brand, that's more than marketing — it's pulling the very rules of competition into an event it's good at. That's not to say the intent is bad; this is genuinely an area that needs a standard, and going first carries risk. But the benefit is real, and worth naming.

Second to smile are the cloud big three: Amazon, Microsoft, and Google. They sell all kinds of AI on their platforms, including other people's models. And when a customer asks "is this safe?", they've had no clean answer. With a common scorecard, they can respond in standard language: "on the severity scale, this model sits around here." Sales friction drops, enterprise adoption decisions speed up. In other words, for them standardization is a revenue accelerator.

Third are the enterprise buyers and CISOs. They finally get to state risk in numbers. If you can tell an exec meeting "this jailbreak has low capability gain and is already widely known, so real risk is limited," you stop scrapping whole projects out of fear. And when something is genuinely serious, you can push hard to block it with evidence. The quality of decisions goes up.

So who isn't smiling? Some "jailbreak hunters" who ran on evidence-free fear marketing, and labs that hold out and refuse to join, could get awkward. Once there's a shared yardstick, "I broke it, so it's a huge deal" hype gets harder to sell, and models outside the standard start facing the question "why won't you validate against the common criteria?" A standard is an umbrella for those inside it, but pressure for those left outside.

Precedents — Successes and Failures

This isn't the first attempt. Security history has tried "let's turn risk into a common score" several times, with mixed results. To gauge which path this scorecard takes, look at the past. Start with the success.

The flagship success is CVSS (Common Vulnerability Scoring System). This standard, which scores software vulnerabilities from 0 to 10, has become the lingua franca of security. Say a vulnerability is "CVSS 9.8" and security teams worldwide understand, with no further explanation, "ah, this is a drop-everything emergency to patch right now." Why did it work? Because in a vendor-neutral way it combined multiple factors (attack difficulty, breadth of impact, and more) into a single number, and the community started actually using it in daily work. Anthropic naming CVSS as its role model is no accident — the very idea of combining four axes into a severity figure closely mirrors CVSS's structure.

Another foundation of that success is shared identifier systems like CVE, plus the culture of responsible disclosure. When you find a vulnerability, you don't just dump it publicly; you notify the vendor first, give them time to patch, then disclose. Anthropic putting up a 24-hour report-monitoring team and pledging to deploy mitigations the moment severity is confirmed looks like an attempt to transplant exactly that culture onto AI jailbreaks — standardizing the find-report-respond pipeline.

But there are plenty of failures too. Industry self-regulation — especially "voluntary safety standards" led by a single company — has often fizzled. Because participation isn't enforced, whenever competition heats up, players quietly lower the bar or pull out entirely, over and over. Social media content-moderation self-standards and early IoT security guidelines are prime examples. They started with good intentions, but with no enforcement and no verification, they became structures where "only the companies that comply lose out."

So which way does this scorecard go? Too early to call. There are success factors (many big-tech participants, the proven CVSS model, a real response team), but the failure factors (no binding force in voluntary standards, competitive pressure) remain intact too. Ultimately two things decide it: whether a major rival like OpenAI joins, and whether a neutral verification system gets attached so the scoring isn't interpreted differently by each company. Without those two, even the best four axes risk staying "an Anthropic score made by Anthropic, for Anthropic."

The Competitor Counter-Play

Now the genuinely interesting question. Will OpenAI and other labs get on board with this common standard, or go their own way? The standardization game is really won or lost right here. Because a standard gets stronger the more participants it has, and turns into a half-measure the moment a key player opts out.

Scenario one: OpenAI joins. Then this scorecard has a real shot at becoming a genuine industry standard. When the two biggest labs use the same yardstick, cloud vendors, enterprise customers, and regulators all naturally fall in line behind it. That said, from OpenAI's seat, belatedly entering "a board Anthropic wrote the rules for first" could be a pride hit. So even if they join, it's likely to be in a "we build it together too" form — coming in while demanding joint governance. Rather than meekly accepting someone else's criteria, they'll ask for a stake.

Scenario two: they go their own way. OpenAI or another lab might push a separate standard, saying "we have our own safety evaluation system." Each lab already has its own red team and eval framework, after all. If that happens, the market ends up with two or three competing standards, and enterprise buyers fall back into confusion — "so which score do I trust?" A standards war. Historically these wars tend to be draining, and the whole market loses until a winner emerges.

Scenario three, maybe the most realistic picture. A neutral third party — say a government standards body like NIST, or an independent consortium — steps in and converges the various labs' proposals into one. CVSS, too, became a standard only once a neutral body (FIRST) managed it, rather than one company owning it. Anthropic's four axes could be a fine starting point, with the final standard refined through a neutral body's hands. Pull that off and you also shed the political burden of "who wrote the rules."

In the end, Anthropic's real challenge isn't technical — it's political. The four axes themselves look reasonable. The problem is whether they can make it "ours, all of ours" rather than "ours, Anthropic's." The diplomacy to bring rivals in as co-authors rather than treating them as enemies, and the restraint to hand the reins to a neutral body. Those two things will decide this scorecard's fate.

So What Actually Changes?

Let me break down what this news actually means for each persona. News is only useful when it lands on "so what do I do about it."

For the enterprise CISO / buyer. This is welcome. Going forward you'll have grounds to demand "this model's jailbreak severity profile" in vendor assessments. The immediate move: ask your AI vendor "how does our model score on these four axes — capability gain, breadth of impact, ease of weaponization, discoverability?" A vendor that can't answer is itself a signal. That said, the standard isn't finalized, so rather than nailing these scores into contracts, use them as reference indicators while watching how standardization plays out.

For the security researcher. The rules of the game may shift. From now on, when you find a jailbreak, quantifying its severity across the four axes and reporting it responsibly may earn more recognition than just bragging "I broke it." Anthropic built a 24-hour report channel and response team, so a responsible-disclosure path is open too. But some findings may be genuinely new risks with low independent discoverability — for those, handle the how and when of disclosure with extra care.

For the everyday Claude user. The visible change may be small right now. But the fact that Fable 5's return came with a new cybersecurity classifier means it may flag or refuse more on code and security-related requests than before. You might hit "I can't help with that" more often even during legitimate dev work. The safety-versus-convenience balance tilted slightly toward safety, so when you get blocked, rewriting your request with more specific, clearly legitimate context should help.

For the regulator. This is a double-edged sword. On one hand, the industry building its own measurable safety criteria gives regulation something to stand on — just as CVSS did, you could reference these scores in government procurement rules or regulatory requirements. On the other hand, adopting private companies' voluntary criteria wholesale raises regulatory-capture concerns. So the wise path for authorities is to neither ignore nor swallow the four axes whole, but to engage actively in the standardization process while demanding a neutral verification system alongside it.

Three Things You're Probably Wondering

If every company scores these four axes differently, doesn't the whole thing become meaningless? Yeah, that's the biggest weak spot. If lab A calls the same jailbreak "low capability gain" and lab B calls it "high," the score loses trust. The reason CVSS worked at all is that a neutral body pinned down the scoring criteria in documentation and the community verified it. This scorecard's fate likewise rides on whether a neutral verification system and clear scoring guide get attached. Right now it's a good starting point, not a finished standard.

Fable 5 going offline for 18 days at the U.S. government's request sounds a little scary. Was this a national-security thing? Honestly, this part deserves caution rather than certainty. Here's what's confirmed: Fable 5 was offline for roughly 18 days, it was connected to U.S. export-control measures, and it returned on July 1 once controls lifted. Fortune framed it as "a necessary truce with the U.S. government." But exactly what security judgment sat behind it is only partly public. So rather than concluding "they shut it off because it was dangerous," it's more accurate for now to see it as the first case where a powerful AI model and national policy collided head-on.

If a new cybersecurity classifier is attached, will it block my normal coding work too? It's possible. Per reporting, the classifier is designed to block jailbreaks and flag more code. Strengthen the defense and false positives on legitimate requests inevitably rise. That said, this is usually something that gets tuned over time, so even if it's a bit frustrating early on, it's likely to get more precise. When you get blocked, spelling out the context of your request more clearly is the best you can do for now.

Sources

Numbers and criteria are as of announcement and may change.

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.

매일 30개+ 소스 분석 · 한국어/영어 이중 언어광고 없음 · 1-클릭 해지