top of page

Introduction to Reinforcement Learning

  • Writer: Nagesh Singh Chauhan
    Nagesh Singh Chauhan
  • 9 hours ago
  • 13 min read

From self-driving cars to game-playing machines to self-learning robots—how intelligent systems learn by interacting with the world


Reinforcement Learning

Introduction to Reinforcement Learning


Reinforcement Learning (RL) is a branch of machine learning that focuses on decision-making under uncertainty. Unlike supervised learning, where models learn from labeled examples, or unsupervised learning, which discovers hidden patterns, reinforcement learning learns by interacting with an environment and receiving feedback in the form of rewards.


RL is inspired by how humans and animals learn through experience: we try actions, observe outcomes, and adjust our behavior to achieve better results over time. This paradigm makes reinforcement learning particularly powerful for sequential decision problems, where each action affects not only the immediate outcome but also future possibilities.


Today, reinforcement learning powers some of the most impressive AI systems—ranging from AlphaGo and autonomous vehicles to recommendation systems, robotics, and dynamic pricing engines.


What is Reinforcement Learning?


At its core, reinforcement learning is a framework in which an agent learns to make decisions by interacting with an environment to maximize cumulative reward over time.

Formally, reinforcement learning answers the question:

What is the best sequence of actions an agent should take to maximize long-term reward in an uncertain environment?

Key characteristics of reinforcement learning include:

  • Learning through trial and error

  • Delayed feedback (rewards may come much later)

  • Sequential dependency (current actions influence future states)

  • No explicit “correct” answers during training


Example: Consider teaching a dog to sit:

  • You don’t provide labeled examples of “correct posture”

  • The dog tries actions

  • You reward correct behavior

  • Over time, the dog learns a policy that maximizes treats


ree

Standard RL training loop. An agent takes in the latest information from the environment (observation and reward) and uses this to choose the next action to take. The action is passed to the environment, which then carries out the next step in the simulation. New information is passed back to the agent which then learns from observations and rewards. Credits


Agent and Environment


Reinforcement learning systems are built around two core components:


Agent

The agent is the learner or decision-maker. It:

  • Observes the environment

  • Chooses actions

  • Receives rewards

  • Learns a strategy (policy)


Environment

The environment is everything the agent interacts with. It:

  • Responds to actions

  • Transitions between states

  • Generates rewards


Example

  • Agent: A self-driving car

  • Environment: Roads, traffic signals, pedestrians, weather

  • Goal: Reach destination safely and quickly


The agent and environment form a feedback loop:

  1. Agent takes an action

  2. Environment responds with a new state and reward

  3. Agent updates its behavior


State

A state represents the current situation of the environment as perceived by the agent.


Examples:

  • Chess: The board configuration

  • Hotel pricing: Occupancy, competitor prices, day of week, upcoming events

  • Robot: Position, velocity, sensor readings


States can be:

  • Fully observable (agent sees everything)

  • Partially observable (agent sees only limited information)


Action

An action is a decision the agent can take in a given state.


Examples:

  • Chess: Move a piece

  • Pricing system: Increase price, decrease price, keep price unchanged

  • Robot: Move left, move right, stop


The set of all possible actions is called the action space, which can be:

  • Discrete (buy/sell/hold)

  • Continuous (steering angle, speed)


Rewards

The reward is the feedback signal that tells the agent how good or bad an action was.


Key properties:

  • Scalar value (positive, negative, or zero)

  • Can be immediate or delayed

  • Defines the agent’s objective


Examples

  • +1 for winning a game, −1 for losing

  • +Revenue for a hotel booking

  • −Penalty for collision in autonomous driving


Importantly, reinforcement learning does not try to maximize immediate reward—it maximizes cumulative future reward, often called the return.


This is what enables long-term planning.


Time Step

A time step is a single interaction cycle:

  • Observe state

  • Take action

  • Receive reward

  • Move to next state


Episode

An episode is a complete sequence of time steps that ends in a terminal state.


Examples:

  • A full game of chess

  • A taxi ride from pickup to drop-off

  • One simulated day of hotel pricing decisions


Some environments are:

  • Episodic (clear start and end)

  • Continuous (no natural endpoint, e.g., stock trading)


Policy in Reinforcement Learning


A policy defines the agent’s behavior. Simply put, a policy tells the agent what action to take in a given state. It is the strategy that the agent follows while interacting with the environment, and learning an optimal policy is the central goal of most reinforcement learning algorithms.


Formally, a policy is a mapping from states to actions:

ree

This means that when the agent is in state s, the policy π decides which action a to take.


If reinforcement learning were a game, the policy would be the rulebook the agent follows to play the game well.

Deterministic vs Stochastic Policies


Deterministic Policy

A deterministic policy always selects the same action for a given state.

ree

Example: In hotel pricing, a deterministic policy might be:

“If occupancy is above 80%, increase price by 10%.”

This approach is simple but can be too rigid in uncertain environments.


Stochastic Policy

A stochastic policy assigns probabilities to actions.

ree

Example:

  • Increase price: 60%

  • Keep price same: 30%

  • Decrease price: 10%


Stochastic policies naturally encourage exploration and are widely used in complex environments.


Policy vs Value vs Actor-Critic Function


There are three main ways to solve reinforcement learning problems:


  • Value-based methods: Learn how good states or actions are (e.g., Q-learning), and derive the policy by choosing the action with the highest value.


  • Policy-based methods: Learn the policy directly, without computing value tables first (e.g., policy gradient methods).


  • Actor-Critic methods: These combine both value-based and policy-based approaches, using a critic to evaluate the policy (the actor) and guide its updates.


Key difference:

Value-based methods ask “How good is this action?”

Policy-based methods ask “What should I do?”


Optimal Policy

The optimal policy is the one that maximizes expected cumulative reward over time.

ree

This policy may sacrifice short-term reward for better long-term outcomes.


Exploration vs. Exploitation


A central challenge in reinforcement learning is the exploration–exploitation trade-off. At each step, an agent must decide whether to exploit existing knowledge to gain immediate reward or explore new actions that might lead to better outcomes in the future.


Exploitation means choosing actions that have already proven to yield high rewards. This approach is safe and effective in the short term, but if relied on too heavily, it can prevent the agent from discovering better strategies and lead to suboptimal performance.


Exploration, on the other hand, involves trying new or uncertain actions. While this can result in lower immediate rewards, it is essential for learning and for understanding the full potential of the environment.


ree

For the sake of clarity: the large zig-zags represent variation and exploratory behavior, while narrower zig-zags and straight lines represent predictable, exploitative behavior. Credits


The most famous formulation of the exploration-exploitation dilemma is what’s called The Multi-Armed Bandit Problem. It’s an academic concept, and it comes with some academic assumptions. I have covered this in great details here.


For example, in a restaurant recommendation system, exploitation would repeatedly suggest a user’s favorite restaurant, while exploration would occasionally recommend a new restaurant that the user might like even more. A good reinforcement learning system carefully balances both—too much exploitation limits learning, and too much exploration harms performance. There are two approaches to balance the exploration-exploitation trade-off ε-greedy learning and Boltzmann (softmax) exploration.


Epsilon-Greedy Learning vs. Boltzmann Exploration


While both aim to encourage exploration, they differ in how actions are selected.


Epsilon-Greedy Learning


Idea: Most of the time, choose the best-known action; occasionally, explore randomly.


How it works:

  • With probability ε (epsilon) → choose a random action (exploration)

  • With probability 1 − ε → choose the action with the highest Q-value (exploitation)


Mathematically:

ree

Intuition:

  • Simple and easy to implement

  • Exploration is uniform: all non-greedy actions are equally likely

  • Often ε is decayed over time (start exploring more, exploit later)


Boltzmann (Softmax) Exploration


Idea: Choose actions probabilistically based on their Q-values—better actions are more likely, but worse ones are still possible.


How it works

Actions are selected using a softmax distribution:


Mathematically:

ree

Where:

  • τ (temperature) controls randomness

    • High τ → more exploration

    • Low τ → more exploitation


Intuition

  • Actions with higher Q-values are chosen more often

  • Exploration is guided, not random

  • As τ → 0, behavior becomes greedy


ε-greedy explores randomly, while Boltzmann exploration explores intelligently—favoring better actions without fully committing too early.

Model-Based Reinforcement Learning


Model-based reinforcement learning is an approach in which the agent has access to, or learns, an explicit model of the environment. Instead of relying only on trial-and-error interactions, the agent uses this model to understand how the environment behaves and to reason about the consequences of its actions before executing them.


The environment model typically includes the following components:


  • State transition dynamics: These describe how the environment changes from one state to another when a specific action is taken. In simple terms, it answers the question: “If I take action A in state S, what state will I end up in?”

  • Reward function: This specifies the reward the agent will receive after taking an action in a given state, helping the agent evaluate whether an outcome is desirable.


By using this model, the agent can:

  • Simulate future outcomes without interacting with the real environment

  • Plan ahead by evaluating multiple possible action sequences and choosing the one that leads to the highest long-term reward


Examples

  • A chess engine that simulates many future moves and countermoves before selecting the best one

  • A robot that plans an optimal path to a destination while avoiding obstacles before it starts moving


A powerful extension of model-based reinforcement learning is the idea of world models. World models are learned internal representations of the environment—often implemented using neural networks—that allow the agent to “imagine” how the world might evolve. Instead of modeling the environment in full detail, a world model captures the essential dynamics in a compact form, enabling the agent to test actions in its own simulated world. This is particularly useful when real-world interactions are expensive, slow, or risky.


Advantages

  • Sample-efficient: Requires fewer real interactions because learning can happen through simulation

  • Strong planning capability: Enables long-term reasoning and foresight


Limitations

  • Model accuracy is critical: Poor or biased models lead to poor decisions

  • Hard to scale: Learning accurate models for complex, real-world environments is challenging


Model-Free Reinforcement Learning


Model-free reinforcement learning is an approach where the agent learns optimal behavior directly through interaction with the environment, without building an explicit model of how the environment works. Instead of predicting future states or rewards, the agent relies on trial and error to learn which actions yield the best long-term outcomes.


In this setting, the agent—let’s call him Bob—does not form a detailed understanding of the environment. He simply acts, observes the result, receives a reward or penalty, and adjusts his behavior. For example, rather than reasoning about where scratching posts are, Bob scratches different furniture and remembers which actions are most rewarding, gradually favoring the best options.


The main advantage of model-free reinforcement learning is its simplicity and adaptability, making it well suited for complex or changing environments where building an accurate model is difficult. However, because learning depends entirely on experience, it is often less sample-efficient, requiring many interactions before consistently finding optimal behavior.


Several widely used reinforcement learning algorithms fall under the model-free category:


  • Q-Learning: learns a value, called a Q-value, for each state–action pair. This value represents the expected future reward of taking a particular action in a given state. By repeatedly updating these values through experience, the agent learns to choose actions with the highest Q-values, thereby maximizing long-term reward.


  • SARSA (State–Action–Reward–State–Action): similar to Q-learning but differs in how it updates its value estimates. It updates the state–action value based on the actual action taken in the next state, making it an on-policy method. This often results in safer, more conservative learning behavior in uncertain environments.


  • Policy Gradient Methods: Instead of learning values for states or actions, policy gradient methods learn the policy directly—a function that maps states to actions. These methods use gradient-based optimization to adjust the policy in the direction that increases expected reward. Popular examples include REINFORCE and Proximal Policy Optimization (PPO).


  • Deep Q-Networks (DQN): Deep Q-Networks extend Q-learning by using deep neural networks to approximate the Q-function. This allows model-free reinforcement learning to scale to high-dimensional state spaces, such as raw images in video games, where traditional tabular methods are infeasible.



Q-learning


Q-learning is one of the most popular and intuitive algorithms in model-free reinforcement learning. Its goal is simple:


learn which action is best to take in each situation by learning from experience.


Instead of trying to understand how the environment works, Q-learning focuses on answering one question repeatedly:

“If I am in this state and take this action, how good will it be in the long run?”

Core Idea of Q-Learning


Q-learning learns a function called the Q-function:


ree

  • s (state): current situation

  • a (action): decision taken

  • Q(s, a): quality (goodness) of that decision


Higher Q-values mean better actions.


Q-Table (Intuition)


In simple environments, Q-learning stores values in a Q-table.


ree

  • Rows → States

  • Columns → Actions

  • Values → Expected long-term reward


The agent simply chooses the action with the highest Q-value for a given state.


How Q-Learning Works (Step by Step)

  1. Start with an empty or randomly initialized Q-table

  2. Observe the current state s

  3. Choose an action a (using exploration or exploitation)

  4. Execute the action

  5. Receive a reward r and observe the next state s′

  6. Update the Q-value

  7. Repeat until learning stabilizes


Q-Learning Update Equation


The heart of Q-learning is its update rule:


ree
ree








Intuitive Explanation of the Formula

New Q-value = Old Q-value + correction

The correction is based on:

  • What reward I just got

  • How good the future looks from here

  • How wrong my old estimate was


Over time, these updates converge to the optimal values.


ree

Imagine a system that sees an apple but incorrectly says, “It’s a mango.” The system is told, “Wrong! It’s an apple.” It learns from this mistake. Next time, when shown the apple, it correctly says “It’s an apple.” This trial-and-error process, guided by feedback is like how Q-Learning works. Source


Key Properties of Q-Learning


  • Model-free (no environment model needed)

  • Off-policy (learns optimal policy even while exploring)

  • Guaranteed convergence (under certain conditions)

  • Q-table grows large for complex problems


When Q-Learning Works Well


  • Small or medium-sized state spaces

  • Discrete actions

  • Stable environments


SARSA (State-Action-Reward-State-Action)


SARSA is a classic model-free, on-policy reinforcement learning algorithm used to learn optimal behavior through direct interaction with the environment. The name SARSA comes from the sequence of elements it uses to update its learning: State, Action, Reward, next State, next Action. Unlike Q-learning, which learns the value of the best possible action, SARSA learns the value of the action actually taken by the agent.


Core Idea

SARSA answers the question:

“How good is the action I actually take, given the policy I am following?”

Because it learns using the agent’s current policy (including exploration), SARSA naturally accounts for risk and uncertainty in action selection.


How SARSA Works (Step by Step)


  1. Observe the current state s

  2. Choose an action a using the current policy (e.g., ε-greedy)

  3. Execute the action and receive a reward r

  4. Observe the next state s′

  5. Choose the next action a′ using the same policy

  6. Update the value of Q(s,a)

  7. Move to state s' and action a′, and repeat


This explicit use of the next chosen action is what distinguishes SARSA from other algorithms.


SARSA Update Equation


The SARSA update rule is:


ree
ree





Intuitive Explanation


  • SARSA updates its knowledge based on what it actually does next, not what it could do if it always behaved optimally.

  • This makes SARSA more conservative and realistic in environments where risky exploration can be costly.


In the classic cliff-walking problem:


  • Falling off the cliff gives a large negative reward

  • Q-learning tends to learn risky but optimal paths near the cliff

  • SARSA learns safer paths, because it accounts for exploratory actions that might accidentally fall off


This difference highlights why SARSA is often preferred in safety-critical environments.


When to Use SARSA


  • When exploration has real risk

  • When safety and stability matter

  • When learning must reflect real behavior, not idealized behavior

SARSA learns the value of actions the agent actually takes, making it an on-policy and safer alternative to Q-learning in uncertain environments.

Policy gradient methods


Policy gradient methods are a class of reinforcement learning algorithms that learn the policy directly, instead of learning value functions first and then deriving a policy from them. While value-based methods focus on estimating how good an action is, policy gradient methods focus on answering a more direct question:

“What action should I take in this state to maximize long-term reward?”

This makes policy gradient methods especially powerful in complex environments with large or continuous action spaces.


Core Idea

In policy gradient methods, the policy is represented as a parameterized function:

ree

ree






Objective Function


The agent aims to maximize the expected return:

ree

Instead of computing this directly, policy gradient methods use gradient ascent to update the policy parameters in the direction that increases expected reward.


Policy Gradient Update Rule


A simplified form of the policy gradient update is:

ree
ree





Intuition:

  • If an action leads to high reward → make it more likely

  • If it leads to low reward → make it less likely


Why Policy Gradient Methods Are Useful


  • They work naturally with continuous action spaces

  • They can learn stochastic policies, which help with exploration

  • They avoid the need for Q-tables or action-value maximization


Variance Reduction with Baselines


Pure policy gradient methods can be noisy. To reduce variance, a baseline is often subtracted from the return:

ree

A common choice for b is the state value function V(s). This idea leads directly to actor–critic methods.


Popular Policy Gradient Algorithms


  • REINFORCE:The simplest policy gradient algorithm; uses full episode returns

  • Actor–Critic Methods:Combine policy gradients with value estimation for stability

  • Proximal Policy Optimization (PPO):A stable and widely used method that limits large policy updates


Policy gradient methods directly learn the decision-making policy by adjusting action probabilities in the direction that maximizes expected reward.

Deep Q-Networks (DQN)


DQNs are an extension of Q-learning that combine reinforcement learning with deep neural networks. They were introduced to overcome one of the biggest limitations of traditional Q-learning: the inability to scale to environments with large or high-dimensional state spaces. Instead of storing Q-values in a table, DQN uses a neural network to approximate the Q-function.


Why DQN Is Needed?


In classic Q-learning, Q-values are stored in a Q-table, which works well only when:

  • The number of states is small

  • States and actions are discrete


However, in many real-world problems—such as video games, robotics, or image-based inputs—the state space is enormous. For example, a raw game screen consists of thousands of pixels, making a Q-table infeasible.


DQN solves this by learning a function approximation:

ree

where θ represents the neural network parameters.


Core Idea

DQN uses a deep neural network that takes the current state as input and outputs a Q-value for each possible action. The agent then selects actions based on these predicted Q-values, typically using an ε-greedy strategy to balance exploration and exploitation.


The learning objective remains the same as Q-learning: estimate how good each action is in each state in terms of long-term reward.


DQN Learning Equation


The target value for training the network is:

ree

The network parameters θ are updated by minimizing the loss:

ree
ree






For example, In an Atari game:

  • State: Raw pixel image of the screen

  • Actions: Move left, move right, fire

  • Reward: Game score


DQN learns to map raw images directly to action values, eventually achieving or surpassing human-level performance in several games.


Extensions of DQN


Several improvements have been proposed:


  • Double DQN: Reduces overestimation bias

  • Dueling DQN: Separates state-value and advantage

  • Prioritized Experience Replay: Samples important experiences more often


Deep Q-Networks replace Q-tables with neural networks, enabling Q-learning to scale to complex, high-dimensional environments like games and vision-based tasks.

Conclusion


Reinforcement Learning represents a powerful shift in how we build intelligent systems—from programming fixed rules to enabling machines to learn by interacting with their environment. By framing problems as sequential decision-making tasks, reinforcement learning allows agents to improve through trial and error, guided by rewards rather than explicit instructions.


Throughout this article, we explored the core building blocks of reinforcement learning, including agents and environments, states and actions, rewards, episodes, and the critical balance between exploration and exploitation. We examined both model-based and model-free approaches, along with key learning paradigms such as Monte Carlo methods, Temporal Difference learning, and widely used algorithms like Q-learning, SARSA, policy gradients, actor–critic methods, and Deep Q-Networks.


Ultimately, reinforcement learning is not just about maximizing rewards; it is about learning how to act optimally over time in uncertain and evolving worlds. As these methods mature, they will play an increasingly central role in building adaptive, autonomous, and intelligent systems across industries.

Follow

  • Linkedin
  • Instagram
  • Twitter
Sphere on Spiral Stairs

©2025 by Intelligent Machines

bottom of page