OpenAI releases CoT monitoring to stop malicious behavior of large models

OpenAI has released the latest research. Using CoT (thinking chain) monitoring, it can prevent malicious behaviors such as big models talking nonsense and hiding their true intentions. It is also one of the effective tools for monitoring super models. OpenAI uses the newly released cutting-edge model o3-mini as the monitored object, and uses the weaker GPT-4o model as the monitor. The testing environment is a coding task, requiring AI to implement functions in the code base to pass the unit test. The results show that the CoT monitor performs well when detecting systematic "reward hacking" behavior, with a recall rate of 95%, far exceeding the 60% of monitoring behavior alone.