Introduction to Reinforcement Learning

Nagesh Singh Chauhan
9 hours ago
13 min read

From self-driving cars to game-playing machines to self-learning robots—how intelligent systems learn by interacting with the world

Image Credits

Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a branch of machine learning that focuses on decision-making under uncertainty. Unlike supervised learning, where models learn from labeled examples, or unsupervised learning, which discovers hidden patterns, reinforcement learning learns by interacting with an environment and receiving feedback in the form of rewards.

RL is inspired by how humans and animals learn through experience: we try actions, observe outcomes, and adjust our behavior to achieve better results over time. This paradigm makes reinforcement learning particularly powerful for sequential decision problems, where each action affects not only the immediate outcome but also future possibilities.

Today, reinforcement learning powers some of the most impressive AI systems—ranging from AlphaGo and autonomous vehicles to recommendation systems, robotics, and dynamic pricing engines.

What is Reinforcement Learning?

At its core, reinforcement learning is a framework in which an agent learns to make decisions by interacting with an environment to maximize cumulative reward over time.

Formally, reinforcement learning answers the question:

What is the best sequence of actions an agent should take to maximize long-term reward in an uncertain environment?

Key characteristics of reinforcement learning include:

Learning through trial and error
Delayed feedback (rewards may come much later)
Sequential dependency (current actions influence future states)
No explicit “correct” answers during training

Example: Consider teaching a dog to sit:

You don’t provide labeled examples of “correct posture”
The dog tries actions
You reward correct behavior
Over time, the dog learns a policy that maximizes treats

Standard RL training loop. An agent takes in the latest information from the environment (observation and reward) and uses this to choose the next action to take. The action is passed to the environment, which then carries out the next step in the simulation. New information is passed back to the agent which then learns from observations and rewards. Credits

Agent and Environment

Reinforcement learning systems are built around two core components:

Agent

The agent is the learner or decision-maker. It:

Observes the environment
Chooses actions
Receives rewards
Learns a strategy (policy)

Environment

The environment is everything the agent interacts with. It:

Responds to actions
Transitions between states
Generates rewards

Example

Agent: A self-driving car
Environment: Roads, traffic signals, pedestrians, weather
Goal: Reach destination safely and quickly

The agent and environment form a feedback loop:

Agent takes an action
Environment responds with a new state and reward
Agent updates its behavior

State

A state represents the current situation of the environment as perceived by the agent.

Examples:

Chess: The board configuration
Hotel pricing: Occupancy, competitor prices, day of week, upcoming events
Robot: Position, velocity, sensor readings

States can be:

Fully observable (agent sees everything)
Partially observable (agent sees only limited information)

Action

An action is a decision the agent can take in a given state.

Examples:

Chess: Move a piece
Pricing system: Increase price, decrease price, keep price unchanged
Robot: Move left, move right, stop

The set of all possible actions is called the action space, which can be:

Discrete (buy/sell/hold)
Continuous (steering angle, speed)

Rewards

The reward is the feedback signal that tells the agent how good or bad an action was.

Key properties:

Scalar value (positive, negative, or zero)
Can be immediate or delayed
Defines the agent’s objective

Examples

+1 for winning a game, −1 for losing
+Revenue for a hotel booking
−Penalty for collision in autonomous driving

Importantly, reinforcement learning does not try to maximize immediate reward—it maximizes cumulative future reward, often called the return.

This is what enables long-term planning.

Time Step

A time step is a single interaction cycle:

Observe state
Take action
Receive reward
Move to next state

Episode

An episode is a complete sequence of time steps that ends in a terminal state.

Examples:

A full game of chess
A taxi ride from pickup to drop-off
One simulated day of hotel pricing decisions

Some environments are:

Episodic (clear start and end)
Continuous (no natural endpoint, e.g., stock trading)

Policy in Reinforcement Learning

A policy defines the agent’s behavior. Simply put, a policy tells the agent what action to take in a given state. It is the strategy that the agent follows while interacting with the environment, and learning an optimal policy is the central goal of most reinforcement learning algorithms.

Formally, a policy is a mapping from states to actions:

This means that when the agent is in state s, the policy π decides which action a to take.

If reinforcement learning were a game, the policy would be the rulebook the agent follows to play the game well.

Deterministic vs Stochastic Policies

Deterministic Policy

A deterministic policy always selects the same action for a given state.

Example: In hotel pricing, a deterministic policy might be:

“If occupancy is above 80%, increase price by 10%.”

This approach is simple but can be too rigid in uncertain environments.

Stochastic Policy

A stochastic policy assigns probabilities to actions.

Example:

Increase price: 60%
Keep price same: 30%
Decrease price: 10%

Stochastic policies naturally encourage exploration and are widely used in complex environments.

Policy vs Value vs Actor-Critic Function

There are three main ways to solve reinforcement learning problems:

Value-based methods: Learn how good states or actions are (e.g., Q-learning), and derive the policy by choosing the action with the highest value.
Policy-based methods: Learn the policy directly, without computing value tables first (e.g., policy gradient methods).
Actor-Critic methods: These combine both value-based and policy-based approaches, using a critic to evaluate the policy (the actor) and guide its updates.

Key difference:

Value-based methods ask “How good is this action?”

Policy-based methods ask “What should I do?”

Optimal Policy

The optimal policy is the one that maximizes expected cumulative reward over time.

This policy may sacrifice short-term reward for better long-term outcomes.

Exploration vs. Exploitation

A central challenge in reinforcement learning is the exploration–exploitation trade-off. At each step, an agent must decide whether to exploit existing knowledge to gain immediate reward or explore new actions that might lead to better outcomes in the future.

Exploitation means choosing actions that have already proven to yield high rewards. This approach is safe and effective in the short term, but if relied on too heavily, it can prevent the agent from discovering better strategies and lead to suboptimal performance.

Exploration, on the other hand, involves trying new or uncertain actions. While this can result in lower immediate rewards, it is essential for learning and for understanding the full potential of the environment.

For the sake of clarity: the large zig-zags represent variation and exploratory behavior, while narrower zig-zags and straight lines represent predictable, exploitative behavior. Credits

The most famous formulation of the exploration-exploitation dilemma is what’s called The Multi-Armed Bandit Problem. It’s an academic concept, and it comes with some academic assumptions. I have covered this in great details here.

For example, in a restaurant recommendation system, exploitation would repeatedly suggest a user’s favorite restaurant, while exploration would occasionally recommend a new restaurant that the user might like even more. A good reinforcement learning system carefully balances both—too much exploitation limits learning, and too much exploration harms performance. There are two approaches to balance the exploration-exploitation trade-off ε-greedy learning and Boltzmann (softmax) exploration.

Epsilon-Greedy Learning vs. Boltzmann Exploration

While both aim to encourage exploration, they differ in how actions are selected.

Epsilon-Greedy Learning

Idea: Most of the time, choose the best-known action; occasionally, explore randomly.

How it works:

With probability ε (epsilon) → choose a random action (exploration)
With probability 1 − ε → choose the action with the highest Q-value (exploitation)

Mathematically:

Intuition:

Simple and easy to implement
Exploration is uniform: all non-greedy actions are equally likely
Often ε is decayed over time (start exploring more, exploit later)

Boltzmann (Softmax) Exploration

Idea: Choose actions probabilistically based on their Q-values—better actions are more likely, but worse ones are still possible.

How it works

Actions are selected using a softmax distribution:

Mathematically:

Where:

τ (temperature) controls randomness
- High τ → more exploration
- Low τ → more exploitation

Intuition

Actions with higher Q-values are chosen more often
Exploration is guided, not random
As τ → 0, behavior becomes greedy

ε-greedy explores randomly, while Boltzmann exploration explores intelligently—favoring better actions without fully committing too early.

Model-Based Reinforcement Learning

Model-based reinforcement learning is an approach in which the agent has access to, or learns, an explicit model of the environment. Instead of relying only on trial-and-error interactions, the agent uses this model to understand how the environment behaves and to reason about the consequences of its actions before executing them.

The environment model typically includes the following components:

State transition dynamics: These describe how the environment changes from one state to another when a specific action is taken. In simple terms, it answers the question: “If I take action A in state S, what state will I end up in?”
Reward function: This specifies the reward the agent will receive after taking an action in a given state, helping the agent evaluate whether an outcome is desirable.

By using this model, the agent can:

Simulate future outcomes without interacting with the real environment
Plan ahead by evaluating multiple possible action sequences and choosing the one that leads to the highest long-term reward

Examples

A chess engine that simulates many future moves and countermoves before selecting the best one
A robot that plans an optimal path to a destination while avoiding obstacles before it starts moving

A powerful extension of model-based reinforcement learning is the idea of world models. World models are learned internal representations of the environment—often implemented using neural networks—that allow the agent to “imagine” how the world might evolve. Instead of modeling the environment in full detail, a world model captures the essential dynamics in a compact form, enabling the agent to test actions in its own simulated world. This is particularly useful when real-world interactions are expensive, slow, or risky.

Advantages

Sample-efficient: Requires fewer real interactions because learning can happen through simulation
Strong planning capability: Enables long-term reasoning and foresight

Limitations

Model accuracy is critical: Poor or biased models lead to poor decisions
Hard to scale: Learning accurate models for complex, real-world environments is challenging

Model-Free Reinforcement Learning

Model-free reinforcement learning is an approach where the agent learns optimal behavior directly through interaction with the environment, without building an explicit model of how the environment works. Instead of predicting future states or rewards, the agent relies on trial and error to learn which actions yield the best long-term outcomes.

In this setting, the agent—let’s call him Bob—does not form a detailed understanding of the environment. He simply acts, observes the result, receives a reward or penalty, and adjusts his behavior. For example, rather than reasoning about where scratching posts are, Bob scratches different furniture and remembers which actions are most rewarding, gradually favoring the best options.

The main advantage of model-free reinforcement learning is its simplicity and adaptability, making it well suited for complex or changing environments where building an accurate model is difficult. However, because learning depends entirely on experience, it is often less sample-efficient, requiring many interactions before consistently finding optimal behavior.

Several widely used reinforcement learning algorithms fall under the model-free category:

Q-Learning: learns a value, called a Q-value, for each state–action pair. This value represents the expected future reward of taking a particular action in a given state. By repeatedly updating these values through experience, the agent learns to choose actions with the highest Q-values, thereby maximizing long-term reward.
SARSA (State–Action–Reward–State–Action): similar to Q-learning but differs in how it updates its value estimates. It updates the state–action value based on the actual action taken in the next state, making it an on-policy method. This often results in safer, more conservative learning behavior in uncertain environments.
Policy Gradient Methods: Instead of learning values for states or actions, policy gradient methods learn the policy directly—a function that maps states to actions. These methods use gradient-based optimization to adjust the policy in the direction that increases expected reward. Popular examples include REINFORCE and Proximal Policy Optimization (PPO).
Deep Q-Networks (DQN): Deep Q-Networks extend Q-learning by using deep neural networks to approximate the Q-function. This allows model-free reinforcement learning to scale to high-dimensional state spaces, such as raw images in video games, where traditional tabular methods are infeasible.

Q-learning

Q-learning is one of the most popular and intuitive algorithms in model-free reinforcement learning. Its goal is simple:

learn which action is best to take in each situation by learning from experience.

Instead of trying to understand how the environment works, Q-learning focuses on answering one question repeatedly:

“If I am in this state and take this action, how good will it be in the long run?”

Core Idea of Q-Learning

Q-learning learns a function called the Q-function:

s (state): current situation
a (action): decision taken
Q(s, a): quality (goodness) of that decision

Higher Q-values mean better actions.

Q-Table (Intuition)

In simple environments, Q-learning stores values in a Q-table.

Rows → States
Columns → Actions
Values → Expected long-term reward

The agent simply chooses the action with the highest Q-value for a given state.

How Q-Learning Works (Step by Step)

Start with an empty or randomly initialized Q-table
Observe the current state s
Choose an action a (using exploration or exploitation)
Execute the action
Receive a reward r and observe the next state s′
Update the Q-value
Repeat until learning stabilizes

Q-Learning Update Equation

The heart of Q-learning is its update rule:

Intuitive Explanation of the Formula

New Q-value = Old Q-value + correction

The correction is based on:

What reward I just got
How good the future looks from here
How wrong my old estimate was

Over time, these updates converge to the optimal values.

Imagine a system that sees an apple but incorrectly says, “It’s a mango.” The system is told, “Wrong! It’s an apple.” It learns from this mistake. Next time, when shown the apple, it correctly says “It’s an apple.” This trial-and-error process, guided by feedback is like how Q-Learning works. Source

Key Properties of Q-Learning

Model-free (no environment model needed)
Off-policy (learns optimal policy even while exploring)
Guaranteed convergence (under certain conditions)
Q-table grows large for complex problems

When Q-Learning Works Well

Small or medium-sized state spaces
Discrete actions
Stable environments

SARSA (State-Action-Reward-State-Action)

SARSA is a classic model-free, on-policy reinforcement learning algorithm used to learn optimal behavior through direct interaction with the environment. The name SARSA comes from the sequence of elements it uses to update its learning: State, Action, Reward, next State, next Action. Unlike Q-learning, which learns the value of the best possible action, SARSA learns the value of the action actually taken by the agent.

Core Idea

SARSA answers the question:

“How good is the action I actually take, given the policy I am following?”

Because it learns using the agent’s current policy (including exploration), SARSA naturally accounts for risk and uncertainty in action selection.

How SARSA Works (Step by Step)

Observe the current state s
Choose an action a using the current policy (e.g., ε-greedy)
Execute the action and receive a reward r
Observe the next state s′
Choose the next action a′ using the same policy
Update the value of Q(s,a)
Move to state s' and action a′, and repeat

This explicit use of the next chosen action is what distinguishes SARSA from other algorithms.

SARSA Update Equation

The SARSA update rule is:

Intuitive Explanation

SARSA updates its knowledge based on what it actually does next, not what it could do if it always behaved optimally.
This makes SARSA more conservative and realistic in environments where risky exploration can be costly.

In the classic cliff-walking problem:

Falling off the cliff gives a large negative reward
Q-learning tends to learn risky but optimal paths near the cliff
SARSA learns safer paths, because it accounts for exploratory actions that might accidentally fall off

This difference highlights why SARSA is often preferred in safety-critical environments.

When to Use SARSA

When exploration has real risk
When safety and stability matter
When learning must reflect real behavior, not idealized behavior

SARSA learns the value of actions the agent actually takes, making it an on-policy and safer alternative to Q-learning in uncertain environments.

Policy gradient methods

Policy gradient methods are a class of reinforcement learning algorithms that learn the policy directly, instead of learning value functions first and then deriving a policy from them. While value-based methods focus on estimating how good an action is, policy gradient methods focus on answering a more direct question:

“What action should I take in this state to maximize long-term reward?”

This makes policy gradient methods especially powerful in complex environments with large or continuous action spaces.

Core Idea

In policy gradient methods, the policy is represented as a parameterized function:

Objective Function

The agent aims to maximize the expected return:

Instead of computing this directly, policy gradient methods use gradient ascent to update the policy parameters in the direction that increases expected reward.

Policy Gradient Update Rule

A simplified form of the policy gradient update is:

Intuition:

If an action leads to high reward → make it more likely
If it leads to low reward → make it less likely

Why Policy Gradient Methods Are Useful

They work naturally with continuous action spaces
They can learn stochastic policies, which help with exploration
They avoid the need for Q-tables or action-value maximization

Variance Reduction with Baselines

Pure policy gradient methods can be noisy. To reduce variance, a baseline is often subtracted from the return:

A common choice for b is the state value function V(s). This idea leads directly to actor–critic methods.

Popular Policy Gradient Algorithms

REINFORCE:The simplest policy gradient algorithm; uses full episode returns
Actor–Critic Methods:Combine policy gradients with value estimation for stability
Proximal Policy Optimization (PPO):A stable and widely used method that limits large policy updates

Policy gradient methods directly learn the decision-making policy by adjusting action probabilities in the direction that maximizes expected reward.

Deep Q-Networks (DQN)

DQNs are an extension of Q-learning that combine reinforcement learning with deep neural networks. They were introduced to overcome one of the biggest limitations of traditional Q-learning: the inability to scale to environments with large or high-dimensional state spaces. Instead of storing Q-values in a table, DQN uses a neural network to approximate the Q-function.

Why DQN Is Needed?

In classic Q-learning, Q-values are stored in a Q-table, which works well only when:

The number of states is small
States and actions are discrete

However, in many real-world problems—such as video games, robotics, or image-based inputs—the state space is enormous. For example, a raw game screen consists of thousands of pixels, making a Q-table infeasible.

DQN solves this by learning a function approximation:

where θ represents the neural network parameters.

Core Idea

DQN uses a deep neural network that takes the current state as input and outputs a Q-value for each possible action. The agent then selects actions based on these predicted Q-values, typically using an ε-greedy strategy to balance exploration and exploitation.

The learning objective remains the same as Q-learning: estimate how good each action is in each state in terms of long-term reward.

DQN Learning Equation

The target value for training the network is:

The network parameters θ are updated by minimizing the loss:

For example, In an Atari game:

State: Raw pixel image of the screen
Actions: Move left, move right, fire
Reward: Game score

DQN learns to map raw images directly to action values, eventually achieving or surpassing human-level performance in several games.

Extensions of DQN

Several improvements have been proposed:

Double DQN: Reduces overestimation bias
Dueling DQN: Separates state-value and advantage
Prioritized Experience Replay: Samples important experiences more often

Deep Q-Networks replace Q-tables with neural networks, enabling Q-learning to scale to complex, high-dimensional environments like games and vision-based tasks.

Conclusion

Reinforcement Learning represents a powerful shift in how we build intelligent systems—from programming fixed rules to enabling machines to learn by interacting with their environment. By framing problems as sequential decision-making tasks, reinforcement learning allows agents to improve through trial and error, guided by rewards rather than explicit instructions.

Throughout this article, we explored the core building blocks of reinforcement learning, including agents and environments, states and actions, rewards, episodes, and the critical balance between exploration and exploitation. We examined both model-based and model-free approaches, along with key learning paradigms such as Monte Carlo methods, Temporal Difference learning, and widely used algorithms like Q-learning, SARSA, policy gradients, actor–critic methods, and Deep Q-Networks.

Ultimately, reinforcement learning is not just about maximizing rewards; it is about learning how to act optimally over time in uncertain and evolving worlds. As these methods mature, they will play an increasingly central role in building adaptive, autonomous, and intelligent systems across industries.