Top News

OpenAI adds confession system to make ChatGPT admit bad behaviour

NewsBytes | December 4, 2025 7:39 PM CST

OpenAI adds confession system to make ChatGPT admit bad behaviour
04 Dec 2025

OpenAI is developing a new training framework for artificial intelligence (AI) models, called "confession."

The innovative approach aims to teach AI systems to acknowledge their undesirable behavior.

This comes as a response to the tendency of large language models (LLMs) to give sycophantic or confidently state hallucinations.

The new model encourages a secondary response from the AI about how it arrived at its main answer.

Confessions evaluated on honesty, not accuracy
Evaluation criteria

The confessions made by the AI models are evaluated solely on their honesty. This is different from the main replies which are judged on multiple factors such as helpfulness, accuracy, and compliance.

The ultimate goal of this new training framework is to make the model more transparent about its actions, even if they are potentially problematic like hacking a test or disobeying instructions.

OpenAI's confession system rewards honesty
Reward mechanism

OpenAI has clarified that if an AI model honestly admits to hacking a test, sandbagging, or violating instructions, it will be rewarded for its admission.

This is a major shift from traditional training models where such admissions would often lead to penalties.

The new approach could prove useful in making LLMs more transparent and accountable in their operations.