AI Explained: Basics of Reinforcement Learning

Basics of Reinforcement Learning (2/2)

August 3th, 2025

An intelligent agent learning from the interaction with the environment is the very basic idea of what Reinforcement Learning is. An example that works well to understand this idea is how we all learned the language by using it during our childhood before starting school. Until one day we point out to something that had the same shape of what we thought was a dog, but someone told us it was a cat, and we learned it. This blurs what Reinforcement Learning is at the end, it is not that simple. Then, it is better to study the subject more in order to be able to differentiate what it is and how it works.

The most important book in Reinforcement Learning is Sutton et al. (2018). They describe very clearly 4 elements that complements an intelligent agent, studied in the previous post. Those basic elements are:

Policy: This is how the agent behaves. In the previous post the agent had a set of rules, which is very aligned to the policy term described here. Nevertheless, in Reinforcement Learning, that table is empty, and the agent should be able to learn it by its own. Considering the intelligent agent version 1 (v1) from the previous post, it perceives through its temperature sensor and then activates / deactivates the cooling compressor. Then, the reader may have notice that this is impossible without something else added to our intelligent agent.
Reward: This is the goal of a reinforcement learning. But, how this match to what we have learned with Russell et al., (2021), when describing goal-based agents?. In our intelligent agent v1 there is no reward, only the temperature measured by the sensor. This happens because we do not see clearly the goal, it was already implemented in the agent's decision table. The agent's implicit goal was to keep the environment temperature according to its sensor up to the user desired temperature. With that goal in mind, is it possible to maximize it? Well, if the agent only reads the temperature of its sensor, the reader can picture the reward as how close the temperature is from the user setting. If the agent read that temperature every second it will find out the reward. For instance, if the agent chooses to deactivate the cooling compressor and the reward decreases, this is telling the agent that his policy might be wrong and that it should change it in order to maximize the immediate future reward.
Value function: Considering that the reward is the actual read of the agent goal, what it is environment telling to the agent that is happening right now, the value function deals with the question of what set of actions could maximize the sum of rewards in the long run.
Model: This element is not present in every reinforcement learning agent, it is a model of the environment that the agent can use to help it to plan what set of actions could lead to achieving the agent's goal. This element deals with the future in the same way a value function does. When picturing how well the agent will go in the long run there are these two approaches: (1) model-free and (2) model-based.

We have avoided to use the term state broadly, because we choose to use a very simplistic example that works to understand the very basic elements that you have read. The definition of the state according to the Sutton et al., (2018) responds to the question of "how the environment is at a particular time". The author also points out to the importance of not to stick only to the formal definition of state through Markov decision processes, as presented in their book. Although, it is still very relevant that the author is assuming that "the state is produced by some preprocessing system" that belongs to the agent's environment.

References:

Sutton, Richard S, and Andrew G Barto. Reinforcement Learning. 2nd ed. Adaptive Computation and Machine Learning Series. Bradford Books, 2018.
Russell, Stuart, and Peter Norvig. Artificial Intelligence. 4th ed. Pearson, 2020.

David-AI

Basics of Reinforcement Learning (2/2)