top of page

Q-Learning in Reinforcement Learning

  • Writer: Nagesh Singh Chauhan
    Nagesh Singh Chauhan
  • 3 days ago
  • 9 min read
"The agent that learns to predict its own future is the agent that learns to control it." — Richard Sutton

Introduction


Imagine you are teaching a dog a new trick. You don't hand the dog a rulebook. Instead, you reward the dog when it does something right and ignore or correct it when it doesn't. Over time, through trial and error, the dog builds an internal model of which behaviours lead to treats — and acts accordingly.


This is, in essence, Reinforcement Learning (RL): an agent learns to make decisions by interacting with an environment, receiving feedback in the form of rewards, and gradually improving its behaviour.


Q-Learning, introduced by Christopher Watkins in his 1989 PhD thesis, is one of the most elegant and impactful algorithms ever developed in RL. It is model-free (no prior knowledge of the environment required), off-policy (learns the optimal strategy even while exploring), and provably convergent for finite problems.


It powers Atari-playing agents, game-playing AIs, robotics controllers, and recommendation systems. Understanding Q-Learning deeply is a prerequisite for anyone serious about reinforcement learning.


The Reinforcement Learning Framework


Before diving into Q-Learning, we need to understand the environment it operates in: the Markov Decision Process (MDP).



An MDP is defined by the tuple (S, A, P, R, γ):

Symbol

Name

Description

S

State space

All possible situations the agent can be in

A

Action space

All possible actions the agent can take

P(s'|s,a)

Transition probability

Probability of reaching s' from s after action a

R(s,a,s')

Reward function

Immediate reward received after a transition

γ

Discount factor

How much future rewards are valued (0 < γ ≤ 1)


The agent's goal is to find a policy π(a|s) — a mapping from states to actions — that maximises the expected cumulative discounted reward:



The discount factor γ is crucial. With γ = 0, the agent only cares about immediate reward. With γ close to 1, it values future rewards almost as much as immediate ones. In practice, γ = 0.9 to 0.99 is common.


The Markov Property: The next state depends only on the current state and action — not on the full history. This is what makes the math tractable.


What Is the Q-Function?


There are two key value functions in RL:


State-value function V^π(s): Expected return starting from state s, following policy π:

Action-value function Q^π(s, a): Expected return starting from state s, taking action a, then following policy π:

Q-Learning focuses on learning the optimal Q-function, Q*(s, a):

This is the maximum expected return achievable from state s by taking action a and then acting optimally forever after.


The relationship between Q* and V* is simple:

And the optimal policy drops out trivially:

This is the power of Q-Learning — once you have Q*, you have the optimal policy for free. No separate policy learning required.


Visualising the Q-Table


For small, discrete state-action spaces, Q(s, a) is stored as a table — rows are states, columns are actions:


The agent reads its current state, looks up the row, and picks the column with the highest value. That's the greedy action.


The Bellman Equation


The Bellman equation is the mathematical backbone of Q-Learning. It expresses a recursive relationship: the value of a state-action pair is the immediate reward plus the discounted value of the next best state.


Bellman Expectation Equation (for a given policy π):


Bellman Optimality Equation (for the optimal policy):


Expanded:


This equation says: the optimal value of (s, a) equals the expected immediate reward, plus the discounted value of being in the best possible next state.


Q-Learning works by approximating this equation empirically, sample by sample, without knowing P or R in advance.


The Q-Learning Update Rule


After observing a transition (St,At,Rt+1,St+1)(S_t, A_t, R_{t+1}, S_{t+1}) (St​,At​,Rt+1​,St+1​), Q-Learning performs:


where δt​ is the TD error:

Let's unpack this:


Intuition: If δ > 0, the outcome was better than expected → increase Q. If δ < 0, worse than expected → decrease Q. If δ = 0, our estimate was perfect → no change.


This is exactly how humans learn from prediction errors. Neuroscience has shown that dopamine neurons in the brain fire in patterns remarkably similar to TD errors — Q-Learning is, in a sense, biologically inspired.


The Full Algorithm


─────────────────────────────────────────────────────────
  Q-LEARNING ALGORITHM
─────────────────────────────────────────────────────────
  Input: α (learning rate), γ (discount), ε (exploration)
  Output: Optimal Q-table Q*(s, a)

  1. Initialise Q(s, a) = 0 for all s ∈ S, a ∈ A
     (or small random values)

  2. For each episode:
     a. Initialise starting state S

     b. For each step t = 0, 1, 2, ...:
        i.  Choose action A using ε-greedy on Q(S, ·)
        ii. Execute A, observe reward R and next state S'
        iii. Update:
             Q(S,A) ← Q(S,A) + α [ R + γ · max_a Q(S',a) − Q(S,A) ]
        iv. S ← S'
        v.  If S is terminal: break

  3. Return Q

Exploration vs Exploitation


A core tension in RL is the explore-exploit dilemma:

  • Exploit: Take the action you currently think is best (greedy)

  • Explore: Try something new — you might discover something better


ε-Greedy Policy is the standard approach:


Decaying ε over training:



Other exploration strategies:

Strategy

How It Works

Best For

ε-Greedy

Random with prob ε

Most cases, simple baseline

Boltzmann/Softmax

Sample proportional to Q values

Smooth exploration

UCB

Optimism in face of uncertainty

Bandits, structured exploration

Noisy Nets

Add learnable noise to network weights

DQN variants

Curiosity-driven

Reward novelty intrinsically

Sparse reward environments


Gridworld: A Step-by-Step Walkthrough


Let's trace Q-Learning through a simple 4×4 gridworld.



Setup:

  • α = 0.1, γ = 0.9, ε = 0.3

  • All Q values initialised to 0


Episode 1, Step 1:

  • State S = 0 (top-left)

  • ε-greedy picks random action → → (right), lands in state 1

  • R = 0 (not terminal)

  • δ = 0 + 0.9 × max(0,0,0,0) - 0 = 0

  • Q(0, →) = 0 + 0.1 × 0 = 0 (no update yet — all Q are 0)


Episode 1, Step N (reaches the goal):

  • State S = 14, action → → lands in state 15 (Goal)

  • R = +1

  • δ = 1 + 0.9 × 0 - 0 = 1.0

  • Q(14, →) = 0 + 0.1 × 1.0 = 0.1 ← first real update!


Episode 2 (reward propagates backward):

  • State 13 → action → → state 14 (Q(14,→) = 0.1)

  • δ = 0 + 0.9 × 0.1 - 0 = 0.09

  • Q(13, →) = 0 + 0.1 × 0.09 = 0.009


Over many episodes, rewards propagate backward through the state space until all Q values accurately reflect the long-term consequences of each action.


Converged Q-table (selected states):



The agent has learned: "From state 14, go right — that's worth 0.85."


Python Implementation on Frozen Lake



import numpy as np
import gym

# ── Environment ──────────────────────────────────────────
env = gym.make("FrozenLake-v1", is_slippery=False)
n_states  = env.observation_space.n   # 16
n_actions = env.action_space.n        # 4

# ── Hyperparameters ──────────────────────────────────────
EPISODES    = 10_000
ALPHA       = 0.1      # learning rate
GAMMA       = 0.99     # discount factor
EPSILON     = 1.0      # initial exploration rate
EPS_DECAY   = 0.995
EPS_MIN     = 0.01

# ── Q-Table ──────────────────────────────────────────────
Q = np.zeros((n_states, n_actions))

rewards_per_episode = []

# ── Training Loop ─────────────────────────────────────────
for episode in range(EPISODES):
    state, _ = env.reset()
    total_reward = 0
    done = False

    while not done:
        # ε-greedy action selection
        if np.random.random() < EPSILON:
            action = env.action_space.sample()          # explore
        else:
            action = np.argmax(Q[state])                # exploit

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Q-Learning update
        td_target = reward + GAMMA * np.max(Q[next_state]) * (not done)
        td_error  = td_target - Q[state, action]
        Q[state, action] += ALPHA * td_error

        state = next_state
        total_reward += reward

    # Decay epsilon
    EPSILON = max(EPS_MIN, EPSILON * EPS_DECAY)
    rewards_per_episode.append(total_reward)

# ── Results ───────────────────────────────────────────────
print(f"Average reward (last 1000 episodes): "
      f"{np.mean(rewards_per_episode[-1000:]):.3f}")

# Visualise learned policy
actions_map = {0: '←', 1: '↓', 2: '→', 3: '↑'}
policy = np.array([actions_map[np.argmax(Q[s])] for s in range(n_states)])
print("\nLearned Policy:")
print(policy.reshape(4, 4))

# ── Output Example ────────────────────────────────────────
# Average reward (last 1000 episodes): 0.974
#
# Learned Policy:
# [['→' '↑' '→' '↑']
#  ['↓' '←' '↓' '←']
#  ['→' '↓' '↓' '←']
#  ['←' '→' '↓' '←']]

Q-Table visualised after training:




Q-Learning vs SARSA vs Monte Carlo


These are the three main families of model-free RL algorithms. Understanding their differences is essential.


Property

Monte Carlo

SARSA (On-Policy TD)

Q-Learning (Off-Policy TD)

Update timing

End of episode

Every step

Every step

Bootstrap?

No

Yes

Yes

Policy type

On-policy

On-policy

Off-policy

What it learns

Value of current policy

Value of current policy

Optimal Q* directly

Bias

Unbiased

Biased

Biased

Variance

High

Low

Low

Handles continuing tasks?

No

Yes

Yes

Risky environments

Neutral

Conservative

Aggressive (optimal)

Convergence

Slow (high variance)

Stable

Fast but can overestimate


Key Difference: Q-Learning vs SARSA


SARSA (on-policy — uses the action actually taken A′):


Q-Learning (off-policy — always uses the best possible next action):



Cliff Walking Example: On a cliff-edge path to the goal, SARSA learns to walk safely away from the cliff (because ε-random steps could fall off). Q-Learning finds the shortest path right along the cliff edge (theoretically optimal but risky with exploration).


Cliff Walking Example. Source
Cliff Walking Example. Source

The Overestimation Problem


Q-Learning has a systematic upward bias in its Q-value estimates. Here's why.


The TD target is:

If Q(S′,a) contains random noise εa​ for each action:


Taking the max over noisy estimates always picks the noisily high ones, systematically overestimating.


Illustration:



Double Q-Learning Fix


Use two independent Q networks, Q1Q_1 Q1​ and Q2Q_2 Q2​:


Standard Q-Learning (same network selects AND evaluates):

Double Q-Learning (Q1​ selects, Q2​ evaluates):


By decoupling selection from evaluation, the positive bias is broken. Even if Q1​ picks a noisily high action, Q2​'s independent estimate tends to be closer to the true value.


Empirical result: Double DQN reduced value overestimation by 50–90% on Atari games while improving median scores by ~15%.


Deep Q-Networks (DQN)


Tabular Q-Learning fails when state spaces are large or continuous (e.g., 84×84 pixel Atari frames = 33,000+ dimensional input). DQN (Mnih et al., DeepMind 2015) solves this by replacing the Q-table with a neural network.


Architecture


Loss Function


where the TD target uses the frozen target network θ−:

θ = online network weights (updated every gradient step)

θ− = target network weights (frozen for CC C steps, then copied from θ)


Modern Extensions


DQN opened the door to a family of improvements. Here is a roadmap of the major ones:



Convergence and the Deadly Triad


Convergence Guarantees (Tabular)


Q-Learning converges to Q^*  if:

  • All (s,a)(s, a) (s,a) pairs are visited infinitely often

  • Learning rates satisfy the Robbins-Monro conditions:

  • Rewards are bounded: ∣R∣ ≤ Rmax⁡​

  • Finite MDP (finite S and A)



The Deadly Triad


With function approximation, convergence guarantees break down. The deadly triad — identified by Sutton & Barto — is the combination of three elements that can cause divergence:


Any two alone are manageable. All three together can cause Q-values to spiral to infinity.


Why DQN works despite this: The experience replay buffer reduces the off-policy problem. The target network slows the bootstrapping instability. Together they tame (but don't eliminate) the deadly triad.


When to Use Q-Learning


Use Q-Learning when:

  • Action space is discrete and manageable (< ~100 actions)

  • State space is either small (tabular) or representable by a CNN/MLP (DQN)

  • You need an off-policy algorithm (want to reuse old experience)

  • Sample efficiency matters — Q-Learning is more sample-efficient than policy gradient

  • The task has clear, delayed rewards (Q-Learning handles sparse rewards better than immediate methods)


Conclusion


Q-Learning is, in many ways, the "linear regression" of reinforcement learning — a simple, powerful, well-understood algorithm that serves as the foundation for nearly everything built on top of it.


The key ideas to internalise:

  • Q(s, a) answers one question: "How much total reward can I expect if I take action a in state s and play optimally from there?"

  • The Bellman equation gives us a self-consistent definition of Q* — and Q-Learning finds it by resolving prediction errors, one step at a time

  • The TD error δ is the signal of surprise — positive means better than expected, negative means worse

  • Q-Learning is off-policy because it always bootstraps toward the best possible future, regardless of exploration

  • DQN makes Q-Learning scale to complex environments via neural network approximation, experience replay, and target networks

  • Every major modern RL system — from AlphaGo to robotics controllers — has Q-Learning or its descendants at its core


The algorithm is deceptively simple. Its update rule fits in one line. But the mathematical structure underlying it — Bellman optimality, contraction mappings, the interplay of exploration and exploitation — is deep and rich. Understanding Q-Learning deeply is not just about knowing the algorithm. It is about developing the intuition for how intelligent behaviour can emerge from nothing more than an agent trying to predict its own future.


Further Reading


  • Sutton & Barto — Reinforcement Learning: An Introduction (2nd ed.) — the bible of RL

  • Mnih et al. (2015) — Human-level control through deep reinforcement learning — the DQN paper

  • van Hasselt et al. (2016) — Deep Reinforcement Learning with Double Q-learning

  • Wang et al. (2016) — Dueling Network Architectures for Deep Reinforcement Learning

  • Schaul et al. (2016) — Prioritized Experience Replay

  • Hessel et al. (2017) — Rainbow: Combining Improvements in Deep Reinforcement Learning

  • Bellemare et al. (2017) — A Distributional Perspective on Reinforcement Learning



Comments


Follow

  • Facebook
  • Linkedin
  • Instagram
  • Twitter
Sphere on Spiral Stairs

©2026 by Intelligent Machines

bottom of page