One of the key ingredients that made ChatGPT a resounding success was an army of human trainers who gave the AI model behind the bot guidance on what constitutes good and bad results. OpenAI now says that adding even more AI to the mix, to help human trainers, could help make AI helpers smarter and more reliable.
In developing ChatGPT, OpenAI pioneered the use of reinforcement learning with human feedback, or RLHF. This technique uses input from human testers to fine-tune an AI model so that its output is considered more consistent, less objectionable, and more accurate. The ratings given by the trainers feed into an algorithm that drives the model’s behavior. The technique has proven crucial both in making chatbots more reliable and useful and in preventing them from misbehaving.
“RLHF works very well, but it has some key limitations,” says Nat McAleese, an OpenAI researcher involved in the new work. For one thing, human feedback can be inconsistent. On the other hand, even skilled humans can find it difficult to value extremely complex outputs, such as sophisticated software code. The process can also optimize a model to produce output that appears convincing rather than accurate.
OpenAI developed a new model by tweaking its most powerful offering, GPT-4, to help human trainers tasked with evaluating code. The company found that the new model, called CriticGPT, could catch bugs that humans missed, and that human judges found their code critiques to be better 63 percent of the time. OpenAI will look to expand the focus to areas beyond code in the future.
“We are starting to work on integrating this technique into our RLHF chat stack,” says McAleese. He notes that the approach is imperfect, as CriticGPT can also make mind-boggling mistakes, but adds that the technique could help make OpenAI models and tools like ChatGPT more accurate by reducing errors in human training. He adds that it could also prove crucial in helping AI models become much smarter, because it could allow humans to help train an AI that surpasses its own capabilities. “And as the models continue to improve, we suspect people will need more help,” says McAleese.
The new technique is one of many now being developed to improve large language models and extract more skills from them. It’s also part of an effort to ensure that AI behaves acceptably even as it becomes more capable.