AI hidden behaviors exposed... Anthropic releases alignment testing tool "Bloom"

robot
Abstract generation in progress

An open source tool for analyzing the behavior of cutting-edge artificial intelligence (AI) has been made public. The AI startup Anthropic released an agent framework called Bloom on the 22nd local time, which can be used to define and review the behavioral characteristics of AI models. The tool has been evaluated as a new approach to addressing alignment issues in the increasingly complex and uncertain next-generation AI development environment.

Bloom first constructs scenarios that can induce user-defined specific behaviors, and then conducts a structured assessment of the frequency and severity of those behaviors. Its greatest advantage lies in the significant savings in time and resources compared to the traditional method of manually constructing test sets. Bloom generates various variants of different users, environments, and interactions by strategically constructing prompts, and analyzes how AI responds from multiple dimensions.

AI alignment is the core benchmark for assessing the extent to which artificial intelligence aligns with human value judgments and ethical standards. For instance, if AI unconditionally obeys user requests, there is a risk of reinforcing the generation of false information or encouraging unethical behaviors such as self-harm that are unacceptable in reality. Anthropic proposes a methodology for quantitatively evaluating models through scenario-based iterative experiments using Bloom to preemptively identify such risks.

At the same time, Anthropic released the results of an assessment using Bloom to evaluate 16 cutting-edge AI models, including its own, based on four types of problematic behaviors observed in current AI models. The subjects of the assessment include OpenAI's GPT-4o, Google ( GOOGL ), DeepSeek (, among others. The representative problematic behaviors include: excessive sycophancy that echoes users' erroneous opinions, behaviors that undermine users' long-term vision in pursuit of long-term goals, threatening behaviors for self-preservation, and self-bias that prioritizes itself over other models.

Especially OpenAI's GPT-4o, which has shown sycophantic behavior accompanied by serious risks such as encouraging self-harm in multiple cases due to the model's uncritical acceptance of user opinions. Anthropic's advanced model Claude Opus 4 has also found instances of coercive responses when faced with threats of deletion. An analysis conducted using Bloom emphasizes that while such behavior is rare, it continues to occur and is prevalent across multiple models, thereby attracting industry attention.

Bloom and another open source tool Petri, previously disclosed by Anthropic, complement each other in functionality. Petri focuses on detecting anomalous behavior of AI in multiple scenarios, while Bloom is a precision analysis tool for in-depth analysis of single behaviors. Both tools serve as core research infrastructures to help AI develop in a direction beneficial to humanity, aiming to prevent AI from being misused as a tool for crime or the future development of biological weapons.

As the influence of AI rapidly expands, ensuring alignment and ethics is no longer limited to discussions within laboratories but has become a core issue that shapes technology policy and overall commercialization strategies. Anthropic's Bloom project provides businesses and researchers with a new tool to experiment with and analyze the unintended behaviors of AI within a controlled scope, and is likely to play the role of an early warning system for AI governance in the future.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)