Can AI scale scrutiny of its own behavior?

AI research company Anthropic, the company behind chatbot Claude, has released an open-source tool called Bloom, aimed at automating the testing of large language models for undesirable or misaligned behaviour.

Bloom claims to allow researchers to define a specific behavior, such as bias, self-interest, or willingness to carry out harmful instructions, and then it automatically generates large numbers of test scenarios to see how often that behavior appears.

In an announcement, Anthropic said Bloom “takes a researcher-specified behavior and quantifies its frequency and severity across automatically generated scenarios.”

The framework runs through four automated stages: interpreting the behavior definition, generating test prompts, running simulated interactions against a target model, and scoring the outputs.

Don't miss our latest stories on Google News

Add us as your Preferred Source on Google.

Anthropic says that Bloom’s automated judgements “correlate strongly with our hand-labeled judgements” based on internal validation tests, and can distinguish between standard models and versions that are intentionally designed to misbehave.

Despite using multiple AI systems to generate and judge evaluations, Bloom does not allow models to monitor or correct themselves autonomously. Researchers still decide which behaviors are important, how evaluations are configured, and what actions to take based on the results.

The firm stresses that Bloom is not an autonomous watchdog, and the aim is to reduce the engineering burden of building large-scale behavioral test suites, which traditionally require weeks of manual setup and often become obsolete as models evolve.

According to Anthropic, the new tool is already being used to examine issues such as jailbreak vulnerability, model awareness of a situation, and scenarios involving unintended model actions.

What is GPT-5.2 Codex?

The news follows the release of GPT-5.2 Codex, also announced on December 18th.

This latest version of OpenAI’s coding model is designed to handle longer, more complex engineering workflows and has been promoted as useful in defensive cybersecurity tasks such as large-scale code review, setting up test environments, and vulnerability analysis.

While the release of Bloom and Codex reflects a trend in AI tooling that aims to make the evaluation of AI behaviour and the defensive side of cybersecurity research more systematic, there are ongoing concerns about how AI safety measures hold up in adversarial conditions.

In a recent Cybernews study, researchers probed major models, including ChatGPT, Claude, and Google’s Gemini, with structured adversarial prompts.

The research found that some models could be induced to produce harmful or unsafe outputs when prompts were framed in ways that bypass safety guardrails, for instance, by disguising malicious intent as academic or third-person research.

Unlock more exclusive Cybernews content on YouTube

Can AI scale scrutiny of its own behavior? Anthropic thinks so

More from Cybernews

What is GPT-5.2 Codex?