OpenAI adds confession system to make ChatGPT admit bad behaviour
04 Dec 2025
OpenAI is developing a new training framework for artificial intelligence (AI) models, called "confession."
The innovative approach aims to teach AI systems to acknowledge their undesirable behavior.
This comes as a response to the tendency of large language models (LLMs) to give sycophantic or confidently state hallucinations.
The new model encourages a secondary response from the AI about how it arrived at its main answer.
Confessions evaluated on honesty, not accuracy
Evaluation criteria
The confessions made by the AI models are evaluated solely on their honesty. This is different from the main replies which are judged on multiple factors such as helpfulness, accuracy, and compliance.
The ultimate goal of this new training framework is to make the model more transparent about its actions, even if they are potentially problematic like hacking a test or disobeying instructions.
OpenAI's confession system rewards honesty
Reward mechanism
OpenAI has clarified that if an AI model honestly admits to hacking a test, sandbagging, or violating instructions, it will be rewarded for its admission.
This is a major shift from traditional training models where such admissions would often lead to penalties.
The new approach could prove useful in making LLMs more transparent and accountable in their operations.
-
Where did Sushma and Swaraj meet for the first time, when did the filmy twist come in the love story?

-
Fund of Rs 1 crore in 10 years, know how much SIP will have to be done every month

-
How to make the new generation hepatitis free… Measures given in ILBS awareness program

-
Parliament passes excise bill; states to get share of higher cigarette duty

-
Kia Shares Entry-Level Electric SUV Teaser Images – What To Expect
