Playing Atari and MuJoCo with Deep Reinforcement Learning

Introduction

Reinforcement Learning (RL) has emerged as a dominant framework for enabling autonomous agents to master complex decision-making tasks through interaction with their environment (Sutton and Barto, 2018). This final project presents a systematic evaluation of Deep Reinforcement Learning (Deep RL) algorithms across two fundamental domains: discrete control from high-dimensional visual inputs (Atari 2600) and continuous control for robotic locomotion (MuJoCo).

The primary objective is to implement, analyze, and contrast the two prevailing paradigms in Deep RL:

Value-Based Methods (Discrete): I explore Deep Q-Networks (DQN) (Mnih et al., 2015) and its advanced variants (Double (Van Hasselt, Guez, and Silver, 2016), Dueling (Wang et al., 2016), Rainbow (Hessel et al., 2018)). The focus is on addressing the instability of Q-learning in high-dimensional state spaces and improving sample efficiency in environments like Breakout and Pong.
Policy-Based Methods (Continuous): I investigate Proximal Policy Optimization (PPO) (Schulman et al., 2017), a state-of-the-art on-policy algorithm. I analyze how trust-region constraints and generalized advantage estimation (GAE (Schulman et al., 2015)) enable stable learning in complex physical systems like Hopper, HalfCheetah, and Ant.

Beyond mere performance benchmarking, this report places significant emphasis on the learning dynamics and implementation details that underpin these algorithms. I delve into critical factors often glossed over in theory, such as observation normalization, reward scaling, and hyperparameter sensitivity. Through rigorous experimentation, I am going to decouple the contribution of algorithmic innovations from engineering “tricks”, providing a transparent view of what makes these agents learn effectively.

Discrete Control

Problem Modeling and Environment Setup

In this section, I am going to introduce the problem modeling and environment setup for the discrete control tasks. I selected two classic Atari 2600 games: ALE/Breakout-v5 and ALE/Pong-v5.

ALE/Breakout-v5

The first environment I chose is the ALE/Breakout-v5 environment provided by the Arcade Learning Environment (ALE) (Bellemare et al., 2013) via Gymnasium (Towers et al., 2023). The goal is to control a paddle to bounce a ball and destroy a wall of bricks. The problem is modeled as a Markov Decision Process (MDP) (Sutton and Barto, 2018) defined by a tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$:

$\mathcal{S}$ is the state space. The raw observation from the environment is an RGB image of pixel size $210 \times 160 \times 3$. To reduce computational complexity and capture temporal information (velocity and direction of the ball), I employ standard preprocessing techniques used in Deep Q-Networks. The raw frames are converted to grayscale, resized to $84 \times 84$, and stacked. Thus, a state $s_t$ represents a stack of the 4 most recent frames:
$$s_t \in \mathbb{R}^{4 \times 84 \times 84}$$
$\mathcal{A}$ is the discrete action space. The agent controls the paddle movement. The action space consists of 4 discrete actions:
$$\mathcal{A} = \{ \text{NOOP}, \text{FIRE}, \text{RIGHT}, \text{LEFT} \}$$
$\mathcal{P}$ is the state transition probability. The transitions are deterministic, governed by the physics engine of the Atari emulator. The next state $s_{t+1}$ depends solely on the current state $s_t$ and the chosen action $a_t$.
$\mathcal{R}$ is the reward function. The agent receives a positive reward when the ball hits a brick. The rewards vary depending on the color of the brick row (1, 4, or 7 points). To stabilize training, the rewards are clipped to the range $[-1, 1]$ during optimization, though the raw score is used for evaluation.
$\gamma$ is the discount factor. I set $\gamma = 0.99$, encouraging the agent to value long-term survival and score accumulation.

ALE/Pong-v5

In addition to Breakout, I also evaluated the algorithms on the ALE/Pong-v5 environment. The goal is to control a paddle to hit a ball back and forth against an AI opponent, aiming to be the first to reach 21 points.

$\mathcal{S}$ is the state space. Similar to Breakout, the raw RGB frames ($210 \times 160 \times 3$) are preprocessed into grayscale, resized to $84 \times 84$, and stacked to capture motion. The state $s_t$ is a stack of the 4 most recent frames:
$$s_t \in \mathbb{R}^{4 \times 84 \times 84}$$
$\mathcal{A}$ is the discrete action space. The action space consists of 6 discrete actions:
$$\mathcal{A} = \{ \text{NOOP}, \text{FIRE}, \text{RIGHT}, \text{LEFT}, \text{RIGHTFIRE}, \text{LEFTFIRE} \}$$
Essentially, the agent moves the paddle up or down (mapped to RIGHT/LEFT in ALE terminology).
$\mathcal{P}$ is the state transition probability. Similar to Breakout, the physics are deterministic. However, the transition also implicitly depends on the opponent’s behavior, which is a fixed AI programmed into the game logic.
$\mathcal{R}$ is the reward function. The reward is $+1$ when the agent wins a rally (the opponent misses the ball) and $-1$ when the agent loses a rally. The episode ends when one player reaches 21 points.
$\gamma$ is the discount factor. I set $\gamma = 0.99$.

Algorithms: Deep Q-Networks

In the discrete control task, I implemented the Deep Q-Network (DQN) (Mnih et al., 2015) and explored three advanced variants to improve stability and sample efficiency: Double DQN (Van Hasselt, Guez, and Silver, 2016), Dueling DQN (Wang et al., 2016), and a simplified “Rainbow” agent (Hessel et al., 2018) combining both techniques.

Vanilla Deep Q-Network (DQN)

DQN adapts Q-learning to high-dimensional state spaces by approximating the optimal action-value function $Q^*(s, a)$ with a deep neural network $Q(s, a; \theta)$, parameterized by weights $\theta$. To ensure training stability, DQN employs two key mechanisms:

Experience Replay: Transitions $(s_t, a_t, r_t, s_{t+1})$ are stored in a cyclic buffer $\mathcal{D}$. Training batches are sampled uniformly, breaking the temporal correlation of sequential data.
Target Network: A separate network with parameters $\theta^-$ is used to compute the Temporal Difference (TD) target. These parameters are frozen and updated to match the online network $\theta$ every $C$ steps.

The loss function minimized at iteration $i$ is the Expectation of the TD error:

$$L_i(\theta_i) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( y_i^{DQN} - Q(s, a; \theta_i) \right)^2 \right]$$

where the target $y_i^{DQN}$ is computed via the Bellman optimality equation:

$$y_i^{DQN} = r + \gamma \max_{a' \in \mathcal{A}} Q(s', a'; \theta_i^-)$$

The algorithm is summarized in Algorithm 1.

Double DQN (DDQN)

The standard DQN creates an overestimation bias because the maximization operator ($\max_{a'}$) in the target calculation tends to select overestimated values. Double DQN addresses this by decoupling action selection from value estimation. The online network $\theta$ selects the best action, while the target network $\theta^-$ estimates its value:

$$y_i^{DDQN} = r + \gamma Q(s', \underset{a' \in \mathcal{A}}{\text{argmax }} Q(s', a'; \theta_i); \theta_i^-)$$

The algorithm is summarized in Algorithm 2.

Dueling DQN

In many states, the value of the state itself is more important than the value of any specific action, e.g., when the ball is far from the paddle in Breakout. Dueling DQN modifies the network architecture to explicitly separate these estimators. The feature extractor output is split into two streams:

Value Stream $V(s; \theta, \beta)$: Estimates the scalar value of the state.
Advantage Stream $A(s, a; \theta, \alpha)$: Estimates the advantage of each action.

To ensure identifiability, the final Q-value is aggregated by subtracting the mean advantage:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left( A(s, a; \theta, \alpha) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a'; \theta, \alpha) \right)$$

Rainbow

In this project, I define the “Rainbow” variant as the combination of the Double Q-learning objective and the Dueling Network architecture. This agent benefits from reduced overestimation bias and improved generalization across states.

Implementation Details

My implementation follows the standard “Nature DQN” architecture adapted for the Gymnasium API.

Preprocessing and Inputs.

The raw Atari frames are preprocessed to reduce computational complexity:

Grayscale & Resizing: Frames are converted to grayscale and resized to $84 \times 84$.
Frame Stacking: The input state $s_t$ consists of the 4 most recent frames stacked along the channel dimension (shape $4 \times 84 \times 84$) to capture motion information, e.g., velocity and direction of the ball.
Normalization: Pixel values are divided by $255.0$ before being fed into the network.
Reward Scale: During training, rewards are clipped to $\text{sign}(r)$ to stabilize gradients. This is particularly important for Breakout where rewards can vary; for Pong, rewards are naturally limited to $\{-1, 0, +1\}$.
Termination Condition: For Breakout, losing a life is treated as a terminal state during training to encourage the agent to value each life. For Pong, episodes terminate only when a player reaches 21 points. Evaluation always follows the standard game rules without artificial termination.
Environment Wrappers: Breakout utilizes a FireResetWrapper to ensure the ball is launched immediately upon reset.
Play before Learning: To diversify initial states, the first 10000 steps are filled with random actions before training begins.
Metric Validity: Crucially, the RecordEpisodeStatistics wrapper is applied before any preprocessing or reward clipping. This ensures that the logged returns represent the true game score (e.g., reaching 21 in Pong or 400+ in Breakout), rather than the modified reward stream seen by the agent.

Evaluation Protocol.

To ensure fair comparison, the evaluation environment is constructed with specific modifications:

In Breakout, terminal_on_life_loss is disabled, allowing the agent to play through all lives.
Reward clipping is disabled to measure the true cumulative return.
The epsilon for exploration is set to $\epsilon_{min} = 0.00$ to maintain a deterministic policy during evaluation.

Training Configuration.

It is crucial to note the differences in the training budget and parallelization setup for the two environments:

Total Steps: The Breakout agents were trained for 5 million steps to ensure convergence given the task’s complexity. In contrast, Pong agents were trained for only 2 million steps, as the task is significantly easier to master.
Parallelization ($N$): To accelerate data collection, I utilized vectorized environments. For Breakout, I set $num\_envs=16$, whereas for Pong, I used $num\_envs=8$.
Update Frequency: The relationship between data collection and network updates is governed by num_envs and train_freq, where I set train_freq = 4 for both environments. Specifically, in each training loop iteration, the agent performs $U = \max(1, \lfloor \texttt{num\_envs} / \texttt{train\_freq} \rfloor)$ gradient updates. This logic ensures that the number of gradient steps scales appropriately with the degree of parallelization.

Network Architecture.

The shared backbone consists of three convolutional layers:

Conv2d (32 filters, kernel $8\times8$, stride 4) + ReLU
Conv2d (64 filters, kernel $4\times4$, stride 2) + ReLU
Conv2d (64 filters, kernel $3\times3$, stride 1) + ReLU

The output is flattened and fed into a fully connected layer with hidden dimension $H$. For the Dueling variant, this layer splits into the Value (output dim 1) and Advantage (output dim $|\mathcal{A}|$) heads.

Optimization

I utilize the Adam optimizer (Kingma and Ba, 2014) with a series of learning rates. To improve robustness against outliers, I use the SmoothL1Loss (Huber Loss) (Huber, 1964) instead of MSE. To prevent exploding gradients, the gradient norm is clipped to 10.0. The exploration strategy is $\epsilon$-greedy, with $\epsilon$ linearly decaying from 1.0 to 0.01 over the first 2 million training steps (after the first 10000 steps of random actions).

Experiments: DQNs on Breakout-v5

Hyperparameter Sensitivity Analysis

I conducted a comprehensive grid search across four DQN variants (Vanilla, Double, Dueling, Rainbow) and three key hyperparameters: Learning Rate (LR), Target Update Frequency ($C$), and Hidden Dimension ($H$). All agents were trained for 5 million steps. Table 1 summarizes the average score over the last 10% of training steps, revealing several critical insights regarding training stability and performance.

{#tab:dqn_results_breakout} Training and evaluation results of different DQN variants under various hyperparameter settings. The subsequent columns show the training and evaluation performance of each DQN variant in the format Train/Eval. For each hyperparameter configuration, the highest evaluation score among the four variants is highlighted in bold.

Interpretation of Performance Metrics.

Before analyzing the hyperparameters, it is essential to address the significant disparity between the reported Training and Evaluation scores, e.g., $23.79$ vs. $197.07$, where the evaluation was conducted over 5 episodes and averaged and the training score was averaged over $N$ environments. This gap is structural and stems from three specific implementation details designed to stabilize training:

Reward Scale: During training, rewards are clipped to $\text{sign}(r)$ (i.e., $+1$ per brick), whereas evaluation uses the raw Atari scoring system (up to $+7$ per brick).
Termination Condition: Training episodes terminate immediately upon losing a life to discourage dangerous behaviors. Evaluation episodes play through all 5 lives, naturally resulting in much higher cumulative scores.

Analysis.

Firstly, the agent demonstrates high sensitivity to the learning rate. Across all variants, $LR=1 \times 10^{-4}$ consistently yields the best performance, i.e., the “Sweet Spot".

Low LR ($5 \times 10^{-5}$): Resulted in slow convergence. While stable, the agents failed to reach high scores within the 5M step budget.
High LR ($5 \times 10^{-4}$): Resulted in catastrophic failure (Eval scores $\approx 40$). At this magnitude, the Q-values likely diverged, preventing the policy from learning meaningful behaviors.

Secondly, the interaction between update frequency and network architecture is notable.

Vanilla & Double DQN performed robustly at a faster frequency $C=1000$.
Dueling & Rainbow benefited significantly from slower updates $C=2000$ or $5000$. For instance, Dueling DQN achieved its global maximum $178.44$ at $C=5000$, and Rainbow peaked $171.34$ at $C=2000$. This suggests that complex architectures with separate value/advantage streams require more stationary targets to stabilize the optimization landscape.

Moreover, increasing the network capacity from $256$ to $512$ did not guarantee improvement. Surprisingly, Vanilla DQN achieved its highest score ($197.07$) with the lighter network ($H=256$), suggesting that for the standard architecture, a larger model might introduce overfitting or optimization difficulties given the limited data diversity of a single game. However, Rainbow generally preferred the larger capacity ($H=512$), utilizing the extra parameters effectively to combine the benefits of Double and Dueling mechanisms.

Summary.

While Vanilla DQN proved surprisingly strong with a smaller network, the advanced variants (Dueling and Rainbow) demonstrated superior potential when tuned with slower target updates. The configuration $LR=1 \times 10^{-4}$ is universally optimal, serving as a reliable baseline for this environment.

Comparative Performance Analysis

To quantify the improvements offered by the algorithmic enhancements, I compared the evaluation learning curves of the four variants using their respective optimal hyperparameter configurations derived from 3.1{reference-type=“ref+label” reference=“sec:hyperparameter”}.

Evaluation Episode Reward over training steps for the four DQN variants. Curves represent the average return of evaluation episodes and are without smoothing for transparency.

{#fig:dqn_eval_comparison width=“90%”}

As shown in 1{reference-type=“ref+label” reference=“fig:dqn_eval_comparison”}, all four algorithms successfully solve the task, reaching scores between 300 and 400.

Convergence Speed: The variants incorporating the Dueling architecture demonstrate a more consistent upward trend in the early-to-mid training phase.
Stability: Notably, the Vanilla DQN (Blue) exhibits high variance in the late stages, with performance oscillating wildly between 400 and near-zero. This instability suggests that while the simple architecture can reach high scores, it is prone to catastrophic forgetting or policy degradation. In contrast, the Rainbow agent maintains a more robust performance profile, benefiting from the combined stability of Double Q-learning and Dueling networks.

Internal Dynamics: Loss and Gradient Stability

To investigate the optimization landscape, I analyzed the training Loss and the $L_2$ norm of the gradients $\|\nabla_\theta L\|_2$. These metrics provide insight into how “hard” the optimizer has to work to fit the target values.

Architecture Efficiency.

A striking pattern emerges in 4{reference-type=“ref+label” reference=“fig:dqn_dynamics”} is that the Dueling DQN consistently maintains the lowest loss and gradient norms throughout the entire training process.

Loss Reduction: The Dueling architecture explicitly models the state value $V(s)$. In Breakout, many pixel changes (background noise or ball movement in non-critical areas) do not affect the optimal action advantage. By isolating $V(s)$, the network can fit the Bellman target more efficiently, resulting in smaller TD errors compared to Vanilla DQN, which must update the entire $Q(s, a)$ for every action.
Gradient Stability: As a direct result of the lower loss, the gradient norms for Dueling DQN are significantly smaller ($\sim 0.5$) compared to Vanilla DQN ($\sim 2.0 - 4.0$). This smoother optimization landscape explains the improved stability observed in the evaluation curves, as the network weights are less likely to suffer from destructive updates caused by exploding gradients.

Q-Value Analysis: The Overestimation Bias

A theoretical weakness of Vanilla DQN is the maximization bias in the target calculation, which tends to overestimate action values. 5{reference-type=“ref+label” reference=“fig:dqn_q_values”} compares the average predicted Q-values during training.

Evolution of Average Q-Values. Vanilla DQN exhibits the highest estimates, while Dueling DQN maintains the most conservative predictions.

{#fig:dqn_q_values width=“90%”}

The plot confirms the theoretical prediction:

Vanilla DQN exhibits the most aggressive Q-value growth, peaking above 6.0. Given that the clipped reward is 1, this implies the agent expects a discounted sum of 6 future bricks.
Dueling DQN maintains the most conservative estimates, staying below 4.0 for most of the training.
Mitigation: Both Double DQN and Rainbow successfully reduce the estimates compared to Vanilla, lying in between the two extremes. This confirms that the Double Q-learning objective effectively mitigates overestimation, preventing the agent from becoming “over-optimistic” about risky states.

Visual Interpretability: Saliency Maps

To verify that the agent has learned meaningful visual features rather than overfitting to noise, I generated Saliency Maps using the final model of the Vanilla DQN agent. The Saliency Map $S$ is computed by taking the gradient of the max Q-value with respect to the input state pixels $s$:

$$\begin{aligned} S = \left| \frac{\partial \max_a Q(s,a)}{\partial s} \right| \end{aligned}$$

High gradient values (hotspots) indicate pixels that, if changed, would most drastically alter the agent’s expected return.

Visualizing Agent Attention. Left: The raw input frame (preprocessed) after the agent played several steps. Right: The Saliency Map overlay. The heatmap reveals that the agent focuses intensely on the specific bricks being targeted (Red/Yellow cluster) and the paddle’s position, demonstrating an understanding of the game physics.

{#fig:saliency_map width=“80%”}

As visualized in 6{reference-type=“ref+label” reference=“fig:saliency_map”}, the attention mechanism is highly localized.

Ball Monitoring: The strongest gradients (Red/Yellow) are concentrated at the position of the ball, and the surrounding gradients information can model the ball’s movement, indicating that the agent has learned to focus on the ball’s position.
Scoring Point Modeling: There is also some attention distribution at the location of the bricks that were knocked down, revealing that the model has learned how to score points and how to get high scores.
Target Identification: There are also some gradients clustered around the dense formation of bricks that the ball is about to hit. This indicates the network is actively calculating which bricks are breakable.
Paddle Tracking: There is also visible activation near the paddle at the bottom, confirming that the agent monitors its own position relative to the ball.
Noise Filtering: The empty black space and the score digits at the top receive almost no attention, proving that the Convolutional Neural Network has successfully learned to ignore irrelevant background information.

Late-Stage Behavior: The “Tunneling” Loop.

An interesting behavioral anomaly was observed when I visualized the agent’s behavior during final evaluation. Although the agent achieves high scores ($>300$), it often fails to clear the entire screen. Instead, it learns to exploit a specific strategy: digging a vertical “tunnel” through the wall of bricks to trap the ball in the upper space. While this yields a large burst of points, the agent struggles to recover once the ball returns to the lower area. It tends to get stuck in a repetitive loop of hitting the remaining side bricks without effectively targeting the final few scattered blocks. This suggests that the policy has converged to a strong local optimum (the “tunneling” strategy) but lacks the sophisticated planning required to systematically clear the board, a known limitation of reactive agents like DQN without explicit hierarchical planning.

Overall, the visualizations confirm that the DQN agent has learned to attend to critical game elements (ball, paddle, target bricks) while filtering out irrelevant noise, demonstrating effective visual feature extraction aligned with gameplay objectives.

Experiments: DQNs on Pong-v5

I also applied the Deep Q-Networks to the Pong environment to verify the generalizability of the algorithms and the hyperparameter findings.

Hyperparameter Sensitivity Analysis

Similar to the Breakout experiment, I conducted a grid search on ALE/Pong-v5 to analyze how the agent performs under different configurations. The results are summarized in 2{reference-type=“ref+label” reference=“tab:dqn_results_pong”}.

{#tab:dqn_results_pong}

DQN Parameter Analysis Results on Pong. The table shows Train/Eval scores. The maximum possible score in Pong is 21 (winning all rallies).

Analysis.

The results on Pong reveal an interesting contrast to Breakout:

Ease of Convergence: Across most hyperparameter settings, the evaluation scores are very high ($\approx 18-21$), indicating that Pong is a significantly easier task for the agent to master compared to Breakout. The zero-sum nature and the direct feedback loop likely facilitate faster learning.
Robustness to Hyperparameters: The agent is remarkably robust. While Breakout showed drastic performance drops with suboptimal parameters, Pong agents generally maintain a winning rate ($>0$) even in less ideal configurations.
The “Sweet Spot” Validated: Despite the easier task, the configuration $LR=1 \times 10^{-4}$ with update frequency $C=2000$ or $5000$ still yields the most consistent near-perfect scores.
Instability of Complex Models at Low LR: Notably, the Dueling and Rainbow variants performed poorly (Eval $\approx 12$) when $LR=5 \times 10^{-5}$ and $H=512$. This suggests that for larger networks, a too-small learning rate may lead to under-fitting or getting stuck in local optima, even in simple environments.

Comparative Performance Analysis

To visualize the learning dynamics, I plotted the learning curves of the four DQN variants using their optimal hyperparameter configurations which are highlighted in 2{reference-type=“ref+label” reference=“tab:dqn_results_pong”}. 9{reference-type=“ref+label” reference=“fig:pong_comparison”} presents both the Training Episode Reward (noisy, exploration-heavy) and the Evaluation Mean Reward (deterministic).

The Exploration Gap.

A striking phenomenon observed in Pong is the significant disparity between training and evaluation performance. While the Evaluation Reward quickly converges to the maximum possible score of $+21$ around 1.5M steps, the Training Reward remains highly volatile and often negative. This is attributed to the $\epsilon$-greedy exploration strategy, where I let $\epsilon$ decay linearly over the entire 2M steps. In a precision-demanding game like Pong, even a small probability of a random action can lead to missing the ball, resulting in an immediate $-1$ penalty. This “Exploration Cost” masks the true competence of the agent during training.

Algorithmic Comparison.

Comparing the variants in 8{reference-type=“ref+label” reference=“fig:pong_eval”}:

Convergence Speed: All algorithms solve the environment effectively. However, Double DQN exhibits a slightly faster initial takeoff, crossing the 0-score threshold (winning more than losing) earlier than Vanilla DQN.
Stability: Once converged (after 1.75M steps), all variants, including Vanilla DQN, stably maintain perfect scores, further confirming that Pong is less prone to the instability issues seen in Breakout.

Internal Optimization Dynamics

To further understand the stability of the learning process in Pong, I analyzed the temporal evolution of the Loss function and the Gradient Norms ($L_2$). 12{reference-type=“ref+label” reference=“fig:pong_dynamics”} visualizes these metrics.

Smoother Optimization Landscape.

A comparative analysis with the Breakout experiment reveals a significant difference in the magnitude of the optimization metrics.

Lower Gradient Norms: In Breakout, the gradient norms frequently oscillated between $2.0$ and $4.0$ for Vanilla DQN. In contrast, for Pong, the gradient norms generally remain below $0.5$ with occasional spikes to $1.2$ for Dueling DQN.
Reduced Loss Scale: Similarly, the TD-error (Loss) in Pong is an order of magnitude smaller.

This significant reduction in gradient variance suggests that the function approximation landscape for Pong is much smoother. The mapping from pixels to the optimal value function is likely less complex, allowing the optimizer to descend towards the minimum without the violent fluctuations seen in the more chaotic Breakout environment. This stability explains why most hyperparameter configurations in Pong were able to reach the maximum score, whereas Breakout required careful tuning.

Q-Value Analysis: Learning Dynamics in Zero-Sum Games

I tracked the evolution of the average predicted Q-values throughout the training process, as shown in 13{reference-type=“ref+label” reference=“fig:pong_q_values”}.

Evolution of Average Q-Values in Pong. Unlike Breakout, Q-values initially drop to negative values before rising, reflecting the penalty for losing points in the early stages.

{#fig:pong_q_values width=“90%”}

Two key observations distinguish the learning dynamics in Pong from Breakout:

Overestimation Mitigation: Consistent with the Breakout results, Vanilla DQN exhibits the highest Q-values, while Double DQN and Rainbow maintain more conservative estimates. This reconfirms that the advanced algorithms effectively reduce maximization bias.
The “Lose-to-Win” Trajectory: Unlike Breakout, where Q-values rise monotonically, the Q-values in Pong exhibit a distinct “V-shape” trajectory:
- Phase 1 (The Dip): In the first 0.5M steps, Q-values drop significantly, reaching negative values (approx $-1.0$). This reflects the agent’s initial incompetence. In Pong, missing a ball results in a $-1$ reward. A random agent loses frequently, so the Q-function correctly learns to predict a negative expected return.
- Phase 2 (The Ascent): As the agent improves (correlated with the rise in evaluation rewards), the Q-values begin to recover, eventually becoming positive as the agent learns to consistently defeat the opponent.
Magnitude Disparity: A final observation is the scale of the Q-values. The estimates in Pong peak around $1.5$, which is significantly lower than the values observed in Breakout. This aligns with the reward structures: Breakout offers the potential for high cumulative scores by breaking many bricks in sequence, whereas Pong is a series of short rallies with rewards strictly bounded by $\{-1, +1\}$, resulting in lower discounted returns.

This trajectory highlights the zero-sum nature of Pong compared to the positive-sum nature of Breakout.

Visual Interpretability: Saliency Maps

Similar to the Breakout analysis, I computed the Saliency Map for the best-performing Rainbow agent in Pong to visualize its attention mechanism. 14{reference-type=“ref+label” reference=“fig:saliency_map_pong”} displays the raw observation and the corresponding gradient heatmap.

Saliency Map for Rainbow DQN in Pong. The agent strongly attends to the ball (center) and the paddles, ignoring the background.

{#fig:saliency_map_pong width=“80%”}

The heatmap reveals that the Rainbow DQN agent has learned to focus on the most game-critical elements:

Ball Tracking: The highest gradient intensity coincides with the ball’s trajectory. This confirms the agent monitors the ball’s position and velocity to calculate its interception point.
Paddle Awareness: There is significant activation around both the agent’s paddle and the opponent’s paddle. This suggests the agent is not playing “blindly” but is reactive to the opponent’s positioning, likely to predict return angles.
Background Suppression: The black background and the scoreboard receive negligible attention (Blue/Dark Blue), indicating effective noise filtering by the convolutional layers.

This visual evidence supports the conclusion that the high scores are a result of learning the correct physics-based strategy rather than exploiting bugs or noise in the emulator.

Summary

In this chapter, I conducted a comprehensive study of Deep Q-Networks on discrete control tasks, ranging from the classic Vanilla DQN to advanced variants like Rainbow. The experiments on Breakout and Pong yielded several critical insights:

Algorithmic Improvements: The proposed enhancements to the original DQN architecture demonstrated clear benefits. Double DQN successfully mitigated Q-value overestimation, particularly in the early stages of learning. Dueling DQN significantly improved sample efficiency and stability by decoupling state-value estimation from action advantages, leading to smoother optimization landscapes. Rainbow DQN, by combining these improvements, consistently achieved top-tier performance.
Task Complexity & Hyperparameters: The contrast between Breakout and Pong highlighted the importance of hyperparameter tuning relative to task difficulty. While Pong was robust to a wide range of configurations, Breakout proved highly sensitive to learning rates and update frequencies. This underscores that there is no “one-size-fits-all” configuration in RL; hyperparameters must be adapted to the specific dynamics of the environment.
Interpretability: Through Q-value analysis and Saliency Maps, I verified that the agents are not merely memorizing patterns but are learning meaningful representations of the game state. The “V-shape” Q-value trajectory in Pong further demonstrated the agent’s ability to correctly model the long-term discounted returns in a zero-sum setting.

These findings confirm the efficacy of value-based methods in high-dimensional visual control tasks while emphasizing the need for careful algorithmic selection and tuning based on the environment’s characteristics.