Reinforcement Learning from Human Feedback: ChatGPT Model

Dhiraj K
3 min readFeb 5, 2023
Photo by Fons Heijnsbroek on Unsplash

Let us understand Reinforcement Learning from the Human Feedback model intuitively. Imagine you created a brand new chatbot(myCB) using a deep-learning language model. It is not working perfectly at the moment, and obviously, you want it to learn to talk like a human.

To train it further, you ask a few questions to the chatbot myCB, For example:

You: “Who are you?”

myCB: “I am the chatbot.”

Then naturally, you would like to give it some feedback about its response, like whether the response was right or it can be improved.

So next thing, you might type in the chatbot:

You: “That is not quite right, myCB; humans usually respond like, “I’m a chatbot.”

The myCB chatbot will take this feedback and use it to update its deep learning language model it is currently working on and, next time, respond based on the human feedback it received.

You: “Who are you?”

myCB: “I’m a chatbot.”

This is what is the simplest example of Reinforcement Learning from Human Feedback.

Before diving deep into Reinforcement Learning from Human Feedback, let us discuss some basic concepts.

What is a Reinforcement Learning?

Reinforcement learning is an interaction between a learner and an environment that provides feedback. Reinforcement Learning is used if sequential decision-making is required and the best behavior is unknown. Also, we can evaluate if the behavior is good or bad.

Reinforcement Learning Terminologies

Agent: The agent is an entity or a learning algorithm that interacts with its surroundings and takes certain actions to maximize rewards. For example, a chatbot could be an agent.

Policy: The policy acts as a strategy to maximize the reward. The policy tells the agent the best actions to take to maximize the total reward. For example, an action in the case of a chatbot is to generate a response.

Reward function: Based on the current state, an action is taken by the agent, generating a new state. This transformation is captured by a function called the reward function. The reward function returns a value that we generally want to maximize. In the case of a chatbot, the reward function’s return value may represent the suitability of the response by the chatbot.

Reinforcement Learning from Human Feedback:

Imagine a machine learning algorithm that can learn which of two possible behaviors is preferable over the other to predict what humans want. Modern Reinforcement Learning challenges are solved using the learning algorithm called Reinforcement Learning via Human Feedback.

Expert preferences are used to learn an estimated policy return, allowing the agent to do a direct policy search. The expert compares the current demonstration to the prior best one and ranks them accordingly. The expert’s ranking input allows the agent to fine-tune the approximation of the policy return, and the procedure is repeated.

By identifying the reward function that best explains the human assessments, the training eventually creates a model for the task’s objective. It then employs reinforcement learning to figure out how to accomplish it. As its behavior develops, it seeks human input in the areas where it is most unsure about which is preferable, further developing its comprehension of the objective.

End notes:

It is interesting to note that the effectiveness of the trained Reinforcement Learning from Human Feedback algorithm depends on the human perception of what behaviors are appropriate; thus if the human doesn’t have a firm understanding of the task, they might not provide as much useful input as needed.



Dhiraj K

Data Scientist & Machine Learning Evangelist. I like to mess with data.